M. Hive Installation and Configuration
- It is a distributed, fault-tolerant data warehouse system enabling large-scale analytics. It facilitates reading, writing, and managing petabytes of data residing in distributed storage using SQL
- Read more on Hive at URL https://cwiki.apache.org/confluence/display/Hive/GettingStarted and https://cwiki.apache.org/confluence/display/Hive/Home
M1. Install and Configure Hive
-
Login as hduser. Ensure that the SSH, DFS, and YARN services are started sequentially. Then, examine them using the jps command
- Install Hive
$ cd ~ $ wget https://dlcdn.apache.org/hive/hive-2.3.9/apache-hive-2.3.9-bin.tar.gz $ tar -xvzf apache-hive-2.3.9-bin.tar.gz $ mv apache-hive-2.3.9-bin hive
- Edit the file ~/.bashrc with the the following lines
export HADOOP_MAPRED_HOME=/home/hduser/hadoop3 export HADOOP_COMMON_HOME=/home/hduser/hadoop3 export HADOOP_HDFS_HOME=/home/hduser/hadoop3 export YARN_HOME=/home/hduser/hadoop3 export HIVE_HOME=/home/hduser/hive export PATH=$HIVE_HOME/bin:$PATH export HIVE_CONF_DIR=/home/hduser/hive/conf export HADOOP_CLASSPATH=/home/hduser/hadoop3/share/hadoop/common/*:/home/hduser/hadoop3/share/hadoop/common/lib/*:/home/hduser/hadoop3/share/hadoop/client/*:/home/hduser/hadoop3/share/hadoop/hdfs/lib/*:/home/hduser/hadoop3/share/hadoop/hdfs/*:/home/hduser/hadoop3/share/hadoop/tools/lib/*:/home/hduser/hadoop3/share/hadoop/mapreduce/*:/home/hduser/hadoop3/share/hadoop/yarn/*:/home/hduser/hadoop3/share/hadoop/yarn/lib/*:/home/hduser/hadoop3/share/hadoop/yarn/timelineservice/*:/home/hduser/hadoop3/share/hadoop/yarn/timelineservice/lib/*:/home/hduser/hadoop3/share/hadoop/yarn/csi/lib/*:/home/hduser/hadoop3/share/hadoop/yarn/csi/* export HIVE_CLASSPATH=/home/hduser/hive/lib/*
- Re-load the environment
$ source ~/.bashrc
- Duplicate for the file ~/hive/conf/hive-site.xml
$ cd ~/hive $ cp conf/hive-default.xml.template conf/hive-site.xml
- Edit the file ~/hive/conf/hive-site.xml
- Replace all occurrences of ${system:java.io.tmpdir} to /tmp/hive and this location is the new location for Hive storing all its temporary files.
- Replace all occurrences of ${system:user.name} to ${user.name} which should be the user name you log in with
- Replace all occurrences of /user/hive tp /user/hduser
You may want to use the nano’s search (Ctrl-w with Enter) to find the targeted keyword, and replace them one by one.
- Edit the file ~/hive/bin/hive-config.sh with the the following lines
export HADOOP_HOME=/home/hduser/hadoop3 export SPARK_HOME=/home/hduser/spark export HBASE_HOME=/homehduser/hbase export HADOOP_MAPRED_HOME=/home/hduser/hadoop3 export HADOOP_COMMON_HOME=/home/hduser/hadoop3 export HADOOP_HDFS_HOME=/home/hduser/hadoop3 export YARN_HOME=/home/hduser/hadoop3 export HIVE_HOME=/home/hduser/hive export HIVE_CONF_DIR=/home/hduser/hive/conf export HADOOP_CLASSPATH=/home/hduser/hadoop3/share/hadoop/common/*:/home/hduser/hadoop3/share/hadoop/common/lib/*:/home/hduser/hadoop3/share/hadoop/client/*:/home/hduser/hadoop3/share/hadoop/hdfs/lib/*:/home/hduser/hadoop3/share/hadoop/hdfs/*:/home/hduser/hadoop3/share/hadoop/tools/lib/*:/home/hduser/hadoop3/share/hadoop/mapreduce/*:/home/hduser/hadoop3/share/hadoop/yarn/*:/home/hduser/hadoop3/share/hadoop/yarn/lib/*:/home/hduser/hadoop3/share/hadoop/yarn/timelineservice/*:/home/hduser/hadoop3/share/hadoop/yarn/timelineservice/lib/*:/home/hduser/hadoop3/share/hadoop/yarn/csi/lib/*:/home/hduser/hadoop3/share/hadoop/yarn/csi/* export HIVE_CLASSPATH=/home/hduser/hive/lib/*
- Duplicate for the file ~/hive/conf/hive-env.sh
$ cd ~/hive $ cp ~/hive/conf/hive-env.sh.template ~/hive/conf/hive-env.sh
- Edit the file ~/hive/conf/hive-env.sh with the the following lines
export HADOOP_HOME=/home/hduser/hadoop3 export SPARK_HOME=/home/hduser/spark export HBASE_HOME=/homehduser/hbase export HADOOP_MAPRED_HOME=/home/hduser/hadoop3 export HADOOP_COMMON_HOME=/home/hduser/hadoop3 export HADOOP_HDFS_HOME=/home/hduser/hadoop3 export YARN_HOME=/home/hduser/hadoop3 export HIVE_HOME=/home/hduser/hive export HIVE_CONF_DIR=/home/hduser/hive/conf export HADOOP_CLASSPATH=/home/hduser/hadoop3/share/hadoop/common/*:/home/hduser/hadoop3/share/hadoop/common/lib/*:/home/hduser/hadoop3/share/hadoop/client/*:/home/hduser/hadoop3/share/hadoop/hdfs/lib/*:/home/hduser/hadoop3/share/hadoop/hdfs/*:/home/hduser/hadoop3/share/hadoop/tools/lib/*:/home/hduser/hadoop3/share/hadoop/mapreduce/*:/home/hduser/hadoop3/share/hadoop/yarn/*:/home/hduser/hadoop3/share/hadoop/yarn/lib/*:/home/hduser/hadoop3/share/hadoop/yarn/timelineservice/*:/home/hduser/hadoop3/share/hadoop/yarn/timelineservice/lib/*:/home/hduser/hadoop3/share/hadoop/yarn/csi/lib/*:/home/hduser/hadoop3/share/hadoop/yarn/csi/* export HIVE_CLASSPATH=/home/hduser/hive/lib/*
- Start both DFS and YARN services before create those required Hive directories in the HDFS. Stop both service after done the directories creation
$ hdfs dfs -mkdir /tmp $ hdfs dfs -mkdir /tmp/hduser $ hdfs dfs -mkdir /user $ hdfs dfs -mkdir /user/hduser $ hdfs dfs -mkdir /user/hduser/warehouse $ hdfs dfs -mkdir /user/hduser/lib $ hdfs dfs -chmod g+w /tmp $ hdfs dfs -chmod g+w /tmp/hduser $ hdfs dfs -chmod g+w /user $ hdfs dfs -chmod g+w /user/hduser $ hdfs dfs -chmod g+w /user/hduser/warehouse $ hdfs dfs -chmod 777 /user/hduser/lib
Tips to start-stop DFS and YARN services
$ jps $ cd ~/hadoop3 $ sbin/start-dfs.sh $ sbin/start-yarn.sh $ jps $ sbin/stop-yarn.sh $ sbin/stop-dfs.sh $ jps
- Change mode of access for hadoop’s data and namenode, which are created in Section E
$ chmod -R g+w /home/hduser/hadoopData $ chmod -R g+w /home/hduser/hadoopName
- Find the targeted jar file. Delete the log4j-slf4j-impl-2.6.2.jar file if found multiple copies (optional)
$ cd ~ $ find . -name log4j-slf4j-impl-*.jar $ rm ~/hive/lib/log4j-slf4j-impl-2.6.2.jar
We delete the file log4j-slf4j-impl-2.6.2.jar because the similar file is also presented in the Hadoop directory, and it gives error to us occasionally
M2. Install and Configure Derby
- Install and configure Derby
$ cd ~ $ wget https://archive.apache.org/dist/db/derby/db-derby-10.14.2.0/db-derby-10.14.2.0-bin.tar.gz $ tar -xzf db-derby-10.14.2.0-bin.tar.gz $ mv db-derby-10.14.2.0-bin derby $ mkdir derby/data
- Edit the file ~/.bashrc with the the following lines
export DERBY_INSTALL=/home/hduser/derby export DERBY_HOME=/home/hduser/derby export PATH=$PATH:/home/hduser/derby/bin export CLASSPATH=$CLASSPATH:/home/hduser/derby/lib/derby.jar:/home/hduser/derby/lib/derbytools.jar
- Re-load the environment
$ source ~/.bashrc
- Configure the Hive to use Network Derby by changing the following property values in file ~/hive/conf/hive-site.xml
<property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:derby://localhost:1527/metastore_db;create=true</value> <description>JDBC connect string for a JDBC metastore</description> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>org.apache.derby.jdbc.ClientDriver</value> <description>Driver class name for a JDBC metastore</description> </property> <property> <name>hive.exec.local.scratchdir</name> <value>/tmp/hive/${user.name}</value> <description>Local scratch space for Hive jobs</description> </property> <property> <name>hive.downloaded.resources.dir</name> <value>/tmp/hive/${hive.session.id}_resources</value> <description>Temporary local directory for added resources in the remote file system.</description> </property> <property> <name>hive.querylog.location</name> <value>/tmp/hive/${user.name}</value> <description>Location of Hive run time structured log file</description> </property> <property> <name>hive.server2.logging.operation.log.location</name> <value>/tmp/hive/${user.name}/operation_logs</value> <description>Top level directory where operation logs are stored if logging functionality is > </property> <property> <name>hive.downloaded.resources.dir</name> <value>/tmp/hive/${hive.session.id}_resources</value> <description>Temporary local directory for added resources in the remote file system</description> </property>
- To allow Hive to connect to Derby through JDBC, add the following lines to the file ~/hadoop3/etc/hadoop/core-site.xml of your Hadoop installation. Then restart your hadoop, i.e. DFS and YARN services. Note that your hive SHOULD NOT running at this step
<configuration> <property> <name>hadoop.proxyuser.hduser.hosts</name> <value>*</value> </property> <property> <name>hadoop.proxyuser.hduser.groups</name> <value>*</value> </property> </configuration>
- To allow PySpark to connect to hive cluster, duplicate (by overwriting) the following files to destination Spark
$ cp -f ~/hadoop3/etc/hadoop/core-site.xml ~/spark/conf/ $ cp -f ~/hadoop3/etc/hadoop/hdfs-site.xml ~/spark/conf/ $ cp -f ~/hive/conf/hive-site.xml ~/spark/conf/ $ cp -f ~/derby/lib/derbyclient.jar ~/spark/jars/ $ cp -f ~/derby/lib/derbytools.jar ~/spark/jars/
Attention: ensure Hadoop, Spark, and Hive are successfully installed BEFORE proceeding with this step. You need to redo this step if ANY CHANGES on the involved files in nearly future.
- Use existing session (named as session 1) to start DFS, YARN, Zookeeper, Kafka and HBase services before running the Derby as follows. Note that you need to press Enter key upon the execution of second of command. It will create databases in the current directory by default
$ cd ~/derby/data $ nohup ~/derby/bin/startNetworkServer -h 0.0.0.0 & $ jps 25410 NetworkServerControl 8711 ... 9561 ... 26699 Jps 9115 ... 8428 ...
To stop Derby (replace 25410 with your process id):
$ kill -9 25410
Tips to start DFS, YARN, Zookeeper, Kafka, HBase and HappyBase (optional) services. Note that a total of elevan (11) services can be observed eventually, included HappyBase and Jps
$ jps $ cd ~/hadoop3 $ sbin/start-dfs.sh $ sbin/start-yarn.sh $ jps $ cd ~/kafka $ bin/zookeeper-server-start.sh config/zookeeper.properties & $ bin/kafka-server-start.sh config/server.properties & $ jps $ cd ~/hbase $ bin/start-hbase.sh $ jps $ bin/hbase thrift start -p 9090 & $ jps
- Use the session 1 to run the following command to initialize Derby as the Metastore database for Hive. It might take a few minutes. Please be patient while you wait for its completion
$ cd ~ $ java -cp /home/hduser/hive/lib/*:/home/hduser/spark/jars/*:/home/hduser/hadoop3/share/hadoop/common/*:/home/hduser/hadoop3/share/hadoop/common/lib/*:/home/hduser/hadoop3/share/hadoop/client/*:/home/hduser/hadoop3/share/hadoop/hdfs/lib/*:/home/hduser/hadoop3/share/hadoop/hdfs/*:/home/hduser/hadoop3/share/hadoop/tools/lib/*:/home/hduser/hadoop3/share/hadoop/mapreduce/*:/home/hduser/hadoop3/share/hadoop/yarn/*:/home/hduser/hadoop3/share/hadoop/yarn/lib/*:/home/hduser/hadoop3/share/hadoop/yarn/timelineservice/*:/home/hduser/hadoop3/share/hadoop/yarn/timelineservice/lib/*:/home/hduser/hadoop3/share/hadoop/yarn/csi/lib/*:/home/hduser/hadoop3/share/hadoop/yarn/csi/* org.apache.hive.beeline.HiveSchemaTool -initSchema -dbType derby
In the setup of Derby 10.14.2.0 with Hive 2.3.9, you need to stop-start the Derby after this step
Note that in the setup of Hive 3.3.1, you may have corrupted characters in file ~/hive/conf/hive-site.xml, which is inherited from its template (duplicated in Section M1), particularlly under the description of ‘hive.txn.xlock.iow’. Just delete it if there is an error, and re-run above command
M3. Running HiveServer2 and Beeline
-
HiveServer2 (HS2) is a service that enables clients to execute queries against Hive. HS2 supports multi-client concurrency and authentication. It is designed to provide better support for open API clients like JDBC and ODBC. Read more on HiveServer2 at URL https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Overview
-
HiveServer2 supports a command shell Beeline that works with HiveServer2. It’s a JDBC client that is based on the SQLLine CLI (http://sqlline.sourceforge.net/). Read more on Beeline at URL https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients
-
Assume other neccessary services have been executed in the session 1.
- Use another session (named as session 2) to edit the ~/hadoop3/etc/hadoop/hadoop-env.sh file by including additional paths into the HADOOP_CLASSPATH as follows
export HADOOP_CLASSPATH=/usr/lib/jvm/java-8-openjdk-amd64/lib/tools.jar:/home/hduser/hive/lib/*:/home/hduser/hadoop3/share/hadoop/common/*:/home/hduser/hadoop3/share/hadoop/common/lib/*:/home/hduser/hadoop3/share/hadoop/client/*
Comment out the above line after stopping HiveServer2 to prevent HBase errors in the next round.
- Switch back to shell session 1, and run the Hive over there. Leave the session running and DO NOT CLOSE the session after the execution of following command.
$ cd ~/hive $ bin/hiveserver2
To stop the Hive, use Ctrl-C.
- Switch back to shell session 2, and ensure an additional one service is observed using command jps, i.e. RunJar process before run Beeline shell.
$ jps $ cd ~/hive $ bin/beeline Beeline version 2.3.9 by Apache Hive beeline> !connect jdbc:hive2:// Connecting to jdbc:hive2:// Enter username for jdbc:hive2://: APP Enter password for jdbc:hive2://: mine Connected to: Apache Hive (version 2.3.9) Driver: Hive JDBC (version 2.3.9) Transaction isolation: TRANSACTION_REPEATABLE_READ 0: jdbc:hive2://>
To stop the Beeline, use either command !q (after login) or Ctrl-c (before login). Note that we always need to terminate the running RunJar service/s BEFORE re-run the HiveServer2.
M4. Quick start with Hive
- Create a new database
0: jdbc:hive2://> show databases; 0: jdbc:hive2://> create database choojunhive; 0: jdbc:hive2://> show databases;
- Create a new table
0: jdbc:hive2://> CREATE TABLE IF NOT EXISTS choojunhive.employee (id int, name string, age int, gender string) COMMENT 'Employee Table' ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; 0: jdbc:hive2://> DESCRIBE choojunhive.employee;
- Insert and retrive records
0: jdbc:hive2://> INSERT INTO choojunhive.employee VALUES (1,'Ahmad',19,'M'); 0: jdbc:hive2://> INSERT INTO choojunhive.employee VALUES (2,'Baba',20,'M'); 0: jdbc:hive2://> INSERT INTO choojunhive.employee VALUES (3,'Chong',21,'M'); 0: jdbc:hive2://> select * from choojunhive.employee;
- Try out the Guided Tutorials as follows
- https://cwiki.apache.org/confluence/display/hive/tutorial
- https://sparkbyexamples.com/apache-hive-tutorial/
- https://www.simplilearn.com/tutorials/hadoop-tutorial/hive