Getting Started with TPC Benchmarking on Google Cloud

Step 1: Prepare your environment.

In addition to Hadoop and Hive, before you begin ensure gcc is installed and available on your system path. If you system does not have it, install it using the fallowing commands.

Command: sudo yum install gcc*


Step 2: Decide which test suite(s) you want to use.

hive-testbench comes with data generators and sample queries based on both the TPC-DS and TPC-H benchmarks. You can choose to use either or both of these benchmarks for experiementation. More information about these benchmarks can be found in the fallowing link.


http://hortonworks.com/blog/benchmarking-apache-hive-13-enterprise-hadoop/

 

Step 3: Download the code for hive-testbench from git hub.

 

  1. If you do not install git on your machine, use the following command to install it.

     

Command:
$sudo yum install git


  1. If you have already installed git, skip step A and download the code for hive-testbench from git using the fallowing command.

     

    Command: $git clone https://github.com/hortonworks/hive-testbench

     


     

Step 4: Change the directory to hive-testbench.

 

Command: cd hive-testbench

 


 

Check the list of files in hive-testbench

 

Command: ls -l

 


Step 5: Compile and package the appropriate data generator.

  1. For TPC-DS, ./tpcds-build.sh downloads, compiles and packages the TPC-DS data generator.

    Command: ./tpcds-build.sh



  2. For TPC-H, ./tpch-build.sh downloads, compiles and packages the TPC-H data generator.

    Command: ./tpch-build.sh



Step 6:
Setup mapreduce path.

  1. Change the directory to home from hive-testbench.

    Command: cd


  2. Change the user to root.

    Command: sudo su


  3. Create a directory /hdp/apps/2.2.4.2-2/mapreduce in hdfs using the fallowing command.

    Command: sudo -u hdfs hdfs dfs -mkdir -p /hdp/apps/2.2.4.2-2/mapreduce/


  4. Move the file /usr/hdp/2.2.4.2-2/hadoop/mapreduce.tar.gz from local to /hdp/apps/2.2.4.2-2/mapreduce/ in hdfs.

    Command: sudo -u hdfs hdfs dfs -put /usr/hdp/2.2.4.2-2/hadoop/mapreduce.tar.gz hdp/apps/2.2.4.2-2/mapreduce/


  5. Move the file /usr/hdp/2.2.4.2-2/hadoop-mapreduce/hadoop-streaming.jar from local to /hdp/apps/2.2.4.2-2/mapreduce/ in hdp.

    Command:
    sudo -u hdfs hdfs dfs -put /usr/hdp/2.2.4.2-2/hadoop-mapreduce/hadoop-streaming.jar /hdp/apps/2.2.4.2-2/mapreduce/

     


     

  6. Change the Owner ship and Permissions for /hdp/apps/2.2.4.2-2/mapreduce.
    1. Changing the ownership for /hdp

    Command: sudo -u hdfs hdfs dfs -chown -R hdfs:hadoop /hdp


    1. Changing the permissions for /hdp/apps/2.2.4.2-2/mapreduce.

    Command: sudo –u hdfs hdfs dfs –chmod –R 555 /hdp/apps/2.2.4.2-2/mapreduce


    1. Changing the permissions for /hdp/apps/2.2.4.2-2/mapreduce/mapreduce.tar.gz

      Command: sudo -u hdfs hdfs dfs -chmod -R 444 /hdp/apps/2.2.4.2-2/mapreduce/mapreduce.tar.gz


  7. Exit from the root user.

    Command: exit


Step 7: Decide how much data you want to generate.

You need to decide on a “Scale Factor” which represents how much data you will generate. Scale Factor roughly translates to gigabytes, so a Scale Factor of 100 is about 100 gigabytes and one terabyte is Scale Factor 1000. Decide how much data you want and keep it in mind for the next step. If you have a cluster of 4-10 nodes or just want to experiment at a smaller scale, scale 1000 (1 TB) of data is a good starting point. If you have a large cluster, you may want to choose Scale 10000 (10 TB) or more. The notion of scale factor is similar between TPC-DS and TPC-H.

If you want to generate a large amount of data, you should use Hive 13 or later. Hive 13 introduced an optimization that allows far more scalable data partitioning. Hive 12 and lower will likely crash if you generate more than a few hundred GB of data and tuning around the problem is difficult. You can generate text or RCFile data in Hive 13 and use it in multiple versions of Hive.

In this case, My suggestion is do not do it more then 10(10GB), otherwise it will take you lots of time for generate and load the data.

Step 8: Generate and load the data.

The scripts tpcds-setup.sh and tpch-setup.sh generate and load data for TPC-DS and TPC-H, respectively. General usage is tpcds-setup.sh scale_factor [directory] or tpch-setup.sh scale_factor [directory]

  1. Change the directory to hive-testbench.

    Command: cd hive-testbench


  2. Loading the 10GB of data for TPC-DS.

    Command: ./tpcds-setup.sh 10


    It will download all the tables data into a database.


     

  3. Loading 10GB data for TPCH-H.

    Command: ./tpch-setup.sh 10


It will download the table’s data into database.


 

 

 

Step 9: Running the Queries.

More than 50 sample TPC-DS queries and all TPC-H queries are included to try. You can use hive, beeline or the SQL tool of your choice. Note that the database is named based on the Data Scale chosen in step 7. At Data Scale 10, your database will be named tpcds_bin_partitioned_orc_10. At Data Scale 1000 it would be named tpcds_bin_partitioned_orc_1000. You can always use the below command to get a list of available databases.

Command: show databases

  1. Running a sample query in TPC-DS.

     

    1. Change the directory.

       

      Command: cd

       


     

        Change the directory to hive-testbench.


        Command: cd hive-testbench    

        


    1. Change the directory to sample-queries-tpcds.

     

    Command: cd sample-queries-tpcds

     


     

    1. You can use the fallowing command to find the list of queries available in TPC-DS to Execute.

       

      Command: ls -l

       


       

      To view the source code of the query use the fallowing command.

       

      Command: cat query12.sql

       


    2. Connect to hive.

       

      Command: hive

       


       

    3. Check the list of database available.

       

      Command: show databases;

       


       

    4. Use the database that contain all the tables data.

       

      Command: use tpcds_bin_partitioned_orc_10;

       


       

    5. The fallowing command is used to Execute the query.

       

      Command: source query15.sql;

       

      Decription of Query15.sql: Report the total catalog sales for customers in selected geographical regions or who made large purchases for a given year and quarter.

       

      The description of each query can be found in the fallowing link.

       

      http://www.slideshare.net/hortonworks/apache-hive-013-performance-benchmarks

       



        

     


     

    Similar way you can execute all the queries in TPC-DS.

     

  2. Running a sample query in TPC-H.

     

    1. Change the directory.

       

      Command: cd

       


     

        Change the directory to hive-testbench.


        Command: cd hive-testbench        

    1. Change the directory to sample-queries-tpch.

     

    Command: cd sample-queries-tpch

     


     

    1. You can use the fallowing command to find the list of queries available in TPC-H to Execute.

       

      Command: ls -l

       


To view the source code of the query use the fallowing command.

Command: cat tpch_query1.sql


  1. Connect to hive.

     

    Command: hive

     


     

  2. Check the list of database available.

     

    Command: show databases;

     


     

  3. Use the database that contain all the tables data.

     

    Command: use tpch_flat_orc_10;

     


     

  4. The fallowing command is used to Execute the query.

     

    Command: source tpch_query1.sql;

     


     


 

Similar way you can execute all the queries in TPC-H.

Leave a Reply