Please see my other blog for Oracle EBusiness Suite Posts - EBMentors

Search This Blog

Note: All the posts are based on practical approach avoiding lengthy theory. All have been tested on some development servers. Please don’t test any post on production servers until you are sure.

Friday, June 23, 2017

Building Teradata Presto Cluster


Prerequisites:
Before working on this post you should review below posts.



Installing/Configuring PrestoDB
Working with PrestoDB Connectors

In this post , I'll be covering below

1- Installing and configuring Presto Admin
2- Installing Presto Cluster on a single node
3- Using Presto ODBC Driver
4- Installing and configuring Presto Cluster with one coordinator and three workers

Download Teradata Presto Distribution
You need to download two packages from Teradata for this post

Server Package - contains Presto Server
http://www.teradata.com/products-and-services/Presto/Download-Get-Started

unzip it , you will see prestoadmin offline package installer, presto server RPM and presto client Jar

Client Package - contains client utilities eg; ODBC Drivers
http://teradata.github.io/presto/docs/127t/installation/odbc.html

Installing Presto Admin

Presto Admin is a tool for installing and managing the Presto query engine on a cluster. It provides easy-to-use commands to:
  • Install and uninstall Presto across your cluster
  • Configure your Presto cluster
  • Start and stop the Presto servers
  • Gather status and log information from your Presto cluster
Prerequisites:
Java 8
Python 2.6 or Python 2.7.

If you are using the online installer then make sure you’ve installed the Python development package for your system. For RedHat/Centos that package is python2-devel and for Debian/Ubuntu it is python-dev.


Presto Admin is packaged as an offline installer – prestoadmin-<version>-offline.tar.gz.

1- Copy the installer prestoadmin-<version>-offline.tar.gz to the location where you want presto-admin to run eg; /usr/hadoopsw. Note that presto-admin does not have to be on the same node(s) where Presto will run, though it does need to have SSH access to all of the nodes in the cluster.

2- Extract and run the installation script from within the prestoadmin directory.
[presto@dn04 hadoopsw]$ ll
total 648140
-rw-r--r--  1 root       root         9456552 Mar 29 21:57 prestoadmin-2.2-offline.el7.tar.gz

[root@dn04 hadoopsw]# tar -xvf prestoadmin-2.2-offline.el7.tar.gz
[root@dn04 hadoopsw]# cd prestoadmin/
[root@dn04 prestoadmin]# ll
total 100
-rwxr-xr-x 1 linuxadmin linuxadmin  2710 Mar 24 01:01 install-prestoadmin.sh
-rw-rw-r-- 1 linuxadmin linuxadmin 91414 Mar 24 01:01 prestoadmin-2.2-py2-none-any.whl
drwxrwxr-x 2 linuxadmin linuxadmin  4096 Jun 18 13:55 third-party

[root@dn04 hadoopsw]# cd prestoadmin/
[root@dn04 prestoadmin]# mkdir conf

[root@dn04 prestoadmin]# mkdir log
[root@dn04 prestoadmin]# ./install-prestoadmin.sh

The installation script will create a presto-admin-install directory and an executable presto-admin script. By default, the presto-admin config and log directory locations are
configured to be ~/.prestoadmin and ~/.prestoadmin/log, respectively. This can be changed by modifying the environment variables, PRESTO_ADMIN_CONFIG_DIR and PRESTO_ADMIN_LOG_DIR.

The installation script will also create the directories pointed to by PRESTO_ADMIN_CONFIG_DIR and PRESTO_ADMIN_LOG_DIR. If those directories already exist, the installation script will not erase their contents.

3- Verify that presto-admin was installed properly by running the following command:

[root@dn04 prestoadmin]# cd presto-admin-install/bin
[root@dn04 bin]# ./presto-admin --help
Usage: presto-admin [options] <command> [arg]

Options:
  --version             show program's version number and exit
  -h, --help            show this help message and exit
  -d, --display         print detailed information about command
  --extended-help       print out all options, including advanced ones
  -I, --initial-password-prompt
                        Force password prompt up-front
  -p PASSWORD, --password=PASSWORD
                        password for use with authentication and/or sudo


Commands:
    catalog add
    catalog remove
    collect logs
    collect query_info
    collect system_info
    configuration deploy
    configuration show
    file copy
    file run
    package install
    package uninstall
    plugin add_jar
    server install
    server restart
    server start
    server status
    server stop
    server uninstall
    server upgrade
    topology show


Configuring Presto-Admin

In order to use presto-admin to manage software on a cluster of nodes, you must specify a configuration for presto-admin. This configuration indicates the nodes on which to install as well as other credentials.

Set the environment variables

[root@dn04 bin]# vi ~/.bash_profile
[root@dn04 bin]# cat ~/.bash_profile

## Teradata Presto Distribution Variables
export PRESTO_ADMIN_HOME=/usr/hadoopsw/prestoadmin
export PRESTO_ADMIN_CONFIG_DIR=$PRESTO_ADMIN_HOME/conf
export PRESTO_ADMIN_LOG_DIR=$PRESTO_ADMIN_HOME/log
export PATH=$PATH:$PRESTO_ADMIN_HOME/presto-admin-install/bin

[root@dn04 bin]# source ~/.bash_profile

Create a file $PRESTO_ADMIN_CONFIG_DIR/config.json , Use the following configuration as a template

{
"username": "<ssh_user_name>",
"port": "<ssh_port>",
"coordinator": "<host_name>",
"workers": ["<host_name_1>", "<host_name_2>", ... "<host_name_n>"],
"java8_home":"<path/to/java8/on/presto/nodes>"
}

All of the properties are optional, and if left out the following defaults will be used:

{
"username": "root",
"port": "22",
"coordinator": "localhost",
"workers": ["localhost"]
}

Installing the Presto Server (Single Node)                        

To install the Presto query engine on a cluster of nodes using presto-admin:
  • Download presto-server-rpm-VERSION.ARCH.rpm
  • Copy the RPM to a location accessible by presto-admin eg; /usr/hadoopsw/.
  • Run the following command to install Presto:
$ ./presto-admin server install <local_path_to_rpm>

[root@dn04 ~]# cd /usr/hadoopsw/
[root@dn04 hadoopsw]# ll
total 1060820
-rw-r--r--  1 root       root       422579061 Mar 31 05:19 presto-server-rpm-0.167-t.0.2.x86_64.rpm

[root@dn04 hadoopsw]# ssh-keygen -t rsa
[root@dn04 hadoopsw]# ssh-copy-id root@localhost
[root@dn04 hadoopsw]# presto-admin server install presto-server-rpm-0.167-t.0.2.x86_64.rpm
Using rpm_specifier as a local path
Fetching local presto rpm at path: presto-server-rpm-0.167-t.0.2.x86_64.rpm
Found existing rpm at: presto-server-rpm-0.167-t.0.2.x86_64.rpm
Deploying rpm on localhost...
Package deployed successfully on: localhost
Package installed successfully on: localhost
Deploying configuration on: localhost
Deploying tpch.properties catalog configurations on: localhost

Presto is now installed on the coordinator and workers ie; localhost

Now start the presto server

[root@dn04 hadoopsw]# presto-admin server start
[localhost] out: Starting presto
[localhost] out: Started as 31018
[localhost] out:
Waiting to make sure we can connect to the Presto server on localhost, please wait. This check will time out after 2 minutes if the server does not respond.
Server started successfully on: localhost

Note: If you make any change in config.properties file for coordinator or worker , you will need to redeploy the configuration as below

presto-admin configuration deploy
Running Presto Queries
Copy the presto-cli to the location you want to run it from. This location may be any node that has network access to the coordinator. Rename the artifact to prestocli and make it executable.

[hdpsysuser@nn01 ~]$ mv presto-cli-0.167-t.0.2-executable.jar prestocli
[root@nn01 hadoopsw]# chmod +x prestocli
[root@nn01 hadoopsw]# ./prestocli --server localhost:8080
presto> select count(*) from tpch.tiny.lineitem;
 _col0
-------
 60175
(1 row)

Query 20170618_140237_00002_tasy2, FINISHED, 1 node
Splits: 33 total, 33 done (100.00%)
0:03 [60.2K rows, 0B] [20.5K rows/s, 0B/s]

presto> exit;

Adding a Catalog
To add a catalog for the Hive connector:

1- Create a file hive.properties in PRESTO_ADMIN_CONFIG_DIR/catalog with the following content:
connector.name=hive-hadoop2
hive.metastore.uri=thrift://192.168.49.134:9083

2- Distribute the configuration file
[root@dn04 ~]# presto-admin catalog add hive

Deploying hive.properties catalog configurations on: localhost

3- Restart Presto:
[root@dn04 ~]# presto-admin server restart
[localhost] out: Stopping presto
[localhost] out: Stopped 31018
[localhost] out:
[localhost] out: Starting presto
[localhost] out: Started as 1874
[localhost] out:
Waiting to make sure we can connect to the Presto server on localhost, please wait. This check will time out after 2 minutes if the server does not respond.
Server started successfully on: localhost

4- Run a  query again  with single node
presto> select count(*) from hive.scott.wordcount;
   _col0
-----------
 295379535
(1 row)

Query 20170618_143118_00007_d68us, FINISHED, 1 node
Splits: 384 total, 384 done (100.00%)
2:34 [295M rows, 16.6GB] [1.92M rows/s, 111MB/s]

Please observe the query output statistics , we will compare it once we will have Presto cluster with 3 workers later in this post.

Configuring ODBC Driver for Teradata Presto

1- Download the drivers from below location and unzip file
http://teradata.github.io/presto/docs/127t/installation/odbc.html

2- Install ODBC driver as per your OS, I used Windows 64bit, and configure ODBC data source


After this you can use this DSN with your desired tool eg; MS PowerBI, Excel, Tableau etc.


Installing the Presto Server (Clustered)               
 
Scenario: Need to install Presto server on a 4 node cluster, one node with coordinator and three nodes with workers using a specific OS user ie; presto.

Coordinater node ==> en01
Worker Nodes ===> dn01,dn02,dn03

Create OS user on all the nodes to be used for Presto cluster and establish the SSH between coordinator and workers nodes. Put the presto user in sudoer list also.

useradd presto
# passwd presto

Generate Keys
[presto@en01 ~]$ su - presto
[presto@en01 ~]$ ssh-keygen -t rsa

Copy Keys to worker nodes
[presto@en01 ~]$ ssh-copy-id presto@dn01
[presto@en01 ~]$ ssh-copy-id presto@dn02
[presto@en01 ~]$ ssh-copy-id presto@dn03

Test the ssh connectivity with worker nodes
[presto@en01 ~]$ ssh 'presto@dn01'
[presto@en01 ~]$ ssh 'presto@dn02'
[presto@en01 ~]$ ssh 'presto@dn03'

Add presto user in sudoer list on all nodes
[presto@en01 hadoopsw]$ su -
[root@en01 ~]# usermod -aG wheel presto
[root@dn01 ~]# usermod -aG wheel presto
[root@dn02 ~]# usermod -aG wheel presto
[root@dn03 ~]# usermod -aG wheel presto

Run visudo utility and add the user in file to run any command without passoword
search for the below line

## Allows people in group wheel to run all commands
and then add the below line 
presto          ALL=(ALL)       NOPASSWD: ALL

[root@en01 ~]# visudo
presto          ALL=(ALL)       NOPASSWD: ALL

Save the file (:wq) and check if it is working 

[presto@en01 hadoopsw]$ sudo whoami
root
[presto@dn01 ~]$ sudo whoami
root
[presto@dn02 ~]$ sudo whoami
root
[presto@dn03 ~]$ sudo whoami
root

Install and Configure Presto-Admin on Coordinator

Install and configure Presto-Admin on coordinator node as prescribed above in this post.

[presto@en01 hadoopsw]$ sudo tar -xvf prestoadmin-2.2-offline.el7.tar.gz
[presto@en01 hadoopsw]$ cd prestoadmin/

[presto@en01 prestoadmin]$ sudo mkdir conf
[presto@en01 prestoadmin]$ sudo mkdir log

[presto@en01 prestoadmin]$ sudo ./install-prestoadmin.sh
[presto@en01 prestoadmin]$ cd presto-admin-install/bin
[presto@en01 bin]$ sudo ./presto-admin --help

Create a file $PRESTO_ADMIN_CONFIG_DIR/config.json , Use the following configuration.

[presto@en01 prestoadmin]$ sudo vi $PRESTO_ADMIN_CONFIG_DIR/config.json

{
"username": "presto",
"port": "22",
"coordinator": "en01",
"workers": ["dn01","dn02","dn03"]
}

Now install presto server, I opted to use an interactive password (-I command line option) prompt, which prompts you for the initial value of your password before running any commands: 


[presto@en01 ~]$ sudo  /usr/hadoopsw/prestoadmin/presto-admin server install /usr/hadoopsw/presto-server-rpm-0.167-t.0.2.x86_64.rpm -I
Initial value for env.password:
Using rpm_specifier as a local path
Fetching local presto rpm at path: /usr/hadoopsw/presto-server-rpm-0.167-t.0.2.x86_64.rpm
Found existing rpm at: /usr/hadoopsw/presto-server-rpm-0.167-t.0.2.x86_64.rpm
Deploying rpm on dn03...
Deploying rpm on dn02...
Deploying rpm on dn01...
Deploying rpm on en01...
Package deployed successfully on: dn03
Package deployed successfully on: dn02
Package deployed successfully on: dn01
Package deployed successfully on: en01
Package installed successfully on: dn03
Deploying configuration on: dn03
Deploying tpch.properties catalog configurations on: dn03
Package installed successfully on: dn02
Deploying configuration on: dn02
Package installed successfully on: en01
Deploying configuration on: en01
Deploying tpch.properties catalog configurations on: dn02
Deploying tpch.properties catalog configurations on: en01
Package installed successfully on: dn01
Deploying configuration on: dn01
Deploying tpch.properties catalog configurations on: dn01

Presto is now installed on the coordinator and workers.

The installation script will create a presto-admin-install directory and an executable presto-admin script. By default, the presto-admin config and log directory locations are configured to be ~/.prestoadmin and ~/.prestoadmin/log, respectively. 

Start the presto server

[presto@en01 ~]$ sudo /usr/hadoopsw/prestoadmin/presto-admin server start -p presto
[dn03] out: Starting presto
[dn03] out: Started as 3251
[dn03] out:
Waiting to make sure we can connect to the Presto server on dn03, please wait. This check will time out after 2 minutes if the server does not respond.
[dn02] out: Starting presto
[en01] out: Starting presto
[dn02] out: Started as 15087
[dn02] out:
Waiting to make sure we can connect to the Presto server on dn02, please wait. This check will time out after 2 minutes if the server does not respond.
[en01] out: Started as 8452
[en01] out:
Waiting to make sure we can connect to the Presto server on en01, please wait. This check will time out after 2 minutes if the server does not respond.
[dn01] out: Starting presto
[dn01] out: Started as 11767
[dn01] out:
Waiting to make sure we can connect to the Presto server on dn01, please wait. This check will time out after 2 minutes if the server does not respond.
Server started successfully on: en01
Server started successfully on: dn02
Server started successfully on: dn03
Server started successfully on: dn01

Check Status command line/web

[presto@en01 ~]$ sudo /usr/hadoopsw/prestoadmin/presto-admin server status -p presto
Server Status:
        en01(IP: 192.168.49.134, Roles: coordinator): Running
        Node URI(http): http://192.168.49.134:8080
        Presto Version: 0.167-t.0.2
        Node status:    active
        Catalogs:     system, tpch
Server Status:
        dn01(IP: 192.168.49.135, Roles: worker): Running
        Node URI(http): http://192.168.49.135:8080
        Presto Version: 0.167-t.0.2
        Node status:    active
        Catalogs:     system, tpch
Server Status:
        dn02(IP: 192.168.49.136, Roles: worker): Running
        Node URI(http): http://192.168.49.136:8080
        Presto Version: 0.167-t.0.2
        Node status:    active
        Catalogs:     system, tpch
Server Status:
        dn03(IP: 192.168.49.137, Roles: worker): Running
        Node URI(http): http://192.168.49.137:8080
        Presto Version: 0.167-t.0.2
        Node status:    active
        Catalogs:     system, tpch





Show Topology

[presto@en01 ~]$  sudo /usr/hadoopsw/prestoadmin/presto-admin topology show
{'coordinator': u'en01',
 'port': 22,
 'username': u'presto',
 'workers': [u'dn01',
             u'dn02',
             u'dn03']}


Adding a Catalog and running query

To add a catalog for the Hive connector:

1- Create a file hive.properties in  /root/.prestoadmin/catalog with the following content:
connector.name=hive-hadoop2
hive.metastore.uri=thrift://192.168.49.134:9083

2- Distribute the configuration file

[presto@en01 ~]$ sudo /usr/hadoopsw/prestoadmin/presto-admin catalog add hive -p presto
Deploying hive.properties catalog configurations on: dn03
Deploying hive.properties catalog configurations on: dn02
Deploying hive.properties catalog configurations on: dn01
Deploying hive.properties catalog configurations on: en01

3- Restart Presto:
[presto@en01 ~]$ sudo /usr/hadoopsw/prestoadmin/presto-admin server restart -p presto

4- Copy the presto-cli ()to the location you want to run it from. This location may be any node that has network access to the coordinator. Rename the artifact to prestocli and make it executable.

[presto@en01 hadoopsw]$ sudo mv /usr/hadoopsw/prestoadmin/presto-cli-0.167-t.0.2-executable.jar /usr/hadoopsw/prestoadmin/prestocli

[presto@en01 hadoopsw]$ sudo chmod +x /usr/hadoopsw/prestoadmin/prestocli

5- Run a  query with cluster 
[presto@en01 hadoopsw]$ /usr/hadoopsw/prestoadmin/prestocli


presto> select count(*) from hive.scott.wordcount;
   _col0
-----------
 295379535
(1 row)

Query 20170619_133900_00026_4hzf9, FINISHED, 3 nodes
Splits: 384 total, 384 done (100.00%)
0:22 [295M rows, 16.6GB] [13.7M rows/s, 789MB/s]

The same query took 2:34 with single worker and with 3 workers it took only 22 seconds.

Performance comparison with Hive

I've an external table in Hive with 117GB data with 2 Billion rows approx, just ran the simple count query both in Hive and Presto (with 3 nodes/workers) and observed the difference, its significant. I also observed CPU usage was much higher when running query in Hive.

External Table Location and size
[hdpsysuser@nn01 ~]$ hdfs dfs -du -h /userdata/wordcount
16.6 G  /userdata/wordcount/wc.txt
99.6 G  /userdata/wordcount/wc5-6-7-8-9-10.txt
1.7 M   /userdata/wordcount/wcs.txt


hive (scott)> select count(*) from wordcount;
...
...
2017-06-21 11:44:02,692 Stage-1 map = 0%,  reduce = 0%
2017-06-21 11:44:13,133 Stage-1 map = 1%,  reduce = 0%, Cumulative CPU 30.19 sec
...
...
2017-06-21 11:47:40,280 Stage-1 map = 98%,  reduce = 33%, Cumulative CPU 3769.99 sec
2017-06-21 11:47:41,306 Stage-1 map = 99%,  reduce = 33%, Cumulative CPU 3794.59 sec
2017-06-21 11:47:47,463 Stage-1 map = 100%,  reduce = 33%, Cumulative CPU 3841.08 sec
2017-06-21 11:47:48,488 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 3843.48 sec
MapReduce Total cumulative CPU time: 0 days 1 hours 4 minutes 3 seconds 480 msec
Ended Job = job_1494154575009_0334
MapReduce Jobs Launched:
Stage-Stage-1: Map: 465  Reduce: 1   Cumulative CPU: 3843.48 sec   HDFS Read: 124809831211 HDFS Write: 110 SUCCESS
Total MapReduce CPU Time Spent: 0 days 1 hours 4 minutes 3 seconds 480 msec
OK
c0
2067479535
Time taken: 238.152 seconds, Fetched: 1 row(s)

Hive query took about 4 minutes to execute.

presto:scott> select count(*) from wordcount;
   _col0
------------
 2067479535
(1 row)

Query 20170621_085324_00006_4hzf9, FINISHED, 3 nodes
Splits: 1,979 total, 1,979 done (100.00%)
1:57 [2.07B rows, 116GB] [17.6M rows/s, 1016MB/s]

Presto query took 1 minute 57 seconds

Now test with a small table (only 16 rows) using order by clause
hive (scott)> select * from emp order by job desc;
...
...
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 8.64 sec   HDFS Read: 10469 HDFS Write: 942 SUCCESS
Total MapReduce CPU Time Spent: 8 seconds 640 msec
OK
emp.empno       emp.ename       emp.job emp.mgr emp.hiredate    emp.sal emp.comm        emp.deptno
7654    MARTIN  SALESMAN        7698    28-SEP-81       1250    1400    30
7499    ALLEN   SALESMAN        7698    20-FEB-81       1600    300     30
7844    TURNER  SALESMAN        7698    08-SEP-81       1500    0       30
7521    WARD    SALESMAN        7698    22-FEB-81       1250    500     30
7839    KING    PRESIDENT               17-NOV-81       5000            10
7698    BLAKE   MANAGER 7839    01-MAY-81       2850            30
7566    JONES   MANAGER 7839    02-APR-81       2975            20
7782    CLARK   MANAGER 7839    09-JUN-81       2450            10
7934    MILLER  CLERK   7782    23-JAN-82       1300            10
7900    JAMES   CLERK   7698    03-DEC-81       950             30
7876    ADAMS   CLERK   7788    23-MAY-87       1100            20
7369    SMITH   CLERK   7902    17-DEC-80       800             20
7902    FORD    ANALYST 7566    03-DEC-81       3000            20
7788    SCOTT   ANALYST 7566    19-APR-87       3000            20
2222    Tanvir  NULL    NULL    NULL    NULL    NULL    NULL
3       Tanvir1 NULL    NULL    NULL    NULL    NULL    NULL
Time taken: 22.688 seconds, Fetched: 16 row(s)

presto:scott> select * from emp order by job desc;
 empno |  ename  |    job    | mgr  | hiredate  | sal  | comm | deptno
-------+---------+-----------+------+-----------+------+------+--------
 7654  | MARTIN  | SALESMAN  | 7698 | 28-SEP-81 | 1250 | 1400 | 30
 7499  | ALLEN   | SALESMAN  | 7698 | 20-FEB-81 | 1600 | 300  | 30
 7521  | WARD    | SALESMAN  | 7698 | 22-FEB-81 | 1250 | 500  | 30
 7844  | TURNER  | SALESMAN  | 7698 | 08-SEP-81 | 1500 | 0    | 30
 7839  | KING    | PRESIDENT |      | 17-NOV-81 | 5000 |      | 10
 7566  | JONES   | MANAGER   | 7839 | 02-APR-81 | 2975 |      | 20
 7698  | BLAKE   | MANAGER   | 7839 | 01-MAY-81 | 2850 |      | 30
 7782  | CLARK   | MANAGER   | 7839 | 09-JUN-81 | 2450 |      | 10
 7934  | MILLER  | CLERK     | 7782 | 23-JAN-82 | 1300 |      | 10
 7369  | SMITH   | CLERK     | 7902 | 17-DEC-80 | 800  |      | 20
 7876  | ADAMS   | CLERK     | 7788 | 23-MAY-87 | 1100 |      | 20
 7900  | JAMES   | CLERK     | 7698 | 03-DEC-81 | 950  |      | 30
 7788  | SCOTT   | ANALYST   | 7566 | 19-APR-87 | 3000 |      | 20
 7902  | FORD    | ANALYST   | 7566 | 03-DEC-81 | 3000 |      | 20
 2222  | Tanvir  | NULL      | NULL | NULL      | NULL | NULL | NULL
 3     | Tanvir1 | NULL      | NULL | NULL      | NULL | NULL | NULL
(16 rows)

Query 20170621_090214_00007_4hzf9, FINISHED, 3 nodes
Splits: 20 total, 20 done (100.00%)
0:00 [16 rows, 687B] [124 rows/s, 5.2KB/s]

Same query on same data using same machines tool 22 seconds in hive and 0 second in Presto.

No comments: