E. Hadoop Installation and Configuration
- Throughout our practical, we will assume that the Hadoop user name is hduser.
- The first time that you launch your WSL Linux distro, you will be prompted to create a default user account. You may choose to either
- name the default user account as tarumt, and
- create a separate user account named hduser.
- The NameNode is the hardware that contains the GNU/Linux operating system and software. The Hadoop distributed file system acts as the master server and can manage the files, control a client’s access to files, and overseas file operating processes such as renaming, opening, and closing files. A DataNode is hardware having the GNU/Linux operating system and DataNode software. For every node in a HDFS cluster, you will locate a DataNode. These nodes help to control the data storage of their system as they can perform operations on the file systems if the client requests, and also create, replicate, and block files when the NameNode instructs.
- The HDFS file system consists of a set of Master services (NameNode, secondary NameNode, and DataNodes). The NameNode and secondary NameNode manage the HDFS metadata. The DataNodes host the underlying HDFS data. The NameNode tracks which DataNodes contain the contents of a given file in HDFS. HDFS divides files into blocks and stores each block on a DataNode. Multiple DataNodes are linked to the cluster. The NameNode then distributes replicas of these data blocks across the cluster. It also instructs the user or application where to locate wanted information.
- Read more on Hadoop at URL https://en.wikipedia.org/wiki/Apache_Hadoop and https://hadoop.apache.org/
E1. Setup User Environment
- Create a new group named hadoop
$ sudo addgroup hadoop
- Create a new user account named hduser (if applicable)
Add the new user:
$ sudo adduser hduser $ sudo adduser student
- Grant the user sudo privileges
$ sudo usermod -aG sudo hduser $ sudo usermod -aG sudo student
- Add hduser to the hadoop group
$ sudo usermod -a -G hadoop hduser $ sudo usermod -a -G hadoop student
- Switch to the user account hduser (if applicable)
$ su - hduser $ su - student
- If you want to go back to your original user session
$ exit
E2. Setup Operating System Environment
-
Reboot/terminate Ubuntu in WSL, and run the following commands from Ubuntu with user hduser
- Check if ssh has been installed
$ service ssh status
- If ssh has not been installed, install ssh
$ sudo apt update $ sudo apt upgrade $ sudo apt install ssh
- Install pdsh
$ sudo apt update $ sudo apt upgrade $ sudo apt install pdsh
- Install Python (if necessary)
$ python3 --version $ sudo apt install software-properties-common $ sudo apt update $ sudo apt install python3 $ sudo apt install python3-dev $ sudo apt install python3-wheel
- Check current python versions and symlinks
$ ls -l /usr/bin/python*
- Set environment variables by editing your ~/.bashrc file (for hduser):
$ nano ~/.bashrc
- In ~/.bashrc file, set Python 3 as the default python version, add the following command to set Python 3 as the default python version:
alias python=python3
- In ~/.bashrc file, add the following lines at the end of the file based on the python3.x from your installation:
export CPATH=/usr/include/python3.10m:$CPATH export LD_LIBRARY_PATH=/usr/lib:$LD_LIBRARY_PATH
-
Save the ~/.bashrc file.
- Source the file:
$ source ~/.bashrc
- Install pip (if necessary)
$ sudo apt update $ sudo apt install python3-pip $ pip3 --version $ sudo pip3 install --upgrade pip
- Check which Java version is installed in the distro
$ java -version
- Install targeted OpenJDK
$ sudo apt-get install openjdk-8-jdk
After this, check if the correct version of Java is installed.
E3. Setup SSH and PDSH
Run the following commands from Ubuntu with user hduser
Secure Shell (SSH), also sometimes called Secure Socket Shell, is a protocol for securely accessing your site’s server over an unsecured network. In other words, it’s a way to safely log in to your server remotely using your preferred command-line interface.
- Check if the SSH service is running
$ service ssh status
- Start SSH
$ sudo service ssh start
Note that WSL does not automatic start sshd (SSH daemon) occasionally. Please start the sshd manually
- Open the SSH port (if necessary)
$ sudo ufw allow ssh
Ubuntu comes with a firewall configuration tool, known as UFW. If the firewall is enabled on your system, make sure to open the targeted SSH port, i.e. port 22.
- Reload SSH
$ sudo /etc/init.d/ssh reload
- Modify pdsh’s default rcmd to ssh, by checking your pdsh default rcmd rsh:
$ pdsh -q -w localhost
- Modify pdsh’s default rcmd to ssh, by adding the following command to ~/.bashrc file (for hduser)
export PDSH_RCMD_TYPE=ssh
- Run the following command to ensure the settings applied.
$ source ~/.bashrc
E4. Disable IPv6
Run the following commands from Ubuntu with user hduser
- Edit the /etc/sysctl.conf file with the following command.
$ sudo nano /etc/sysctl.conf
- Add the following lines to the end of the sysctl.conf file:
# disable ipv6 net.ipv6.conf.all.disable_ipv6 = 1 net.ipv6.conf.default.disable_ipv6 = 1 net.ipv6.conf.lo.disable_ipv6 = 1
- To reload the /etc/sysctl.conf settings, issue the following command.
$ sudo sysctl -p
- Check if IPv6 has been successfully disabled
$ cat /proc/sys/net/ipv6/conf/all/disable_ipv6
0 means IPv6 is enabled; 1 means IPv6 is disabled.
-
Suppose that IPv6 is still enabled after rebooting, you must carry out the following:
- Create (with root privileges) the file /etc/rc.local
$ sudo nano /etc/rc.local
- Insert the following into the file /etc/rc.local:
#!/bin/bash /etc/sysctl.d /etc/init.d/procps restart exit 0
- Make the file executable
sudo chmod 755 /etc/rc.local
- Create (with root privileges) the file /etc/rc.local
E5. Install Hadoop
-
Switch to the hduser account
- Download the Hadoop binary by finding the appropriate Hadoop binary from the Hadoop releases page.
We will use Hadoop 3.3.6 to avoid problems with HBase in a later practical. Read more at URL https://hbase.apache.org/book.html#hadoop
$ wget https://downloads.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz
Or, you may copy the downloaded tar.gz file manually to destination ~/hduser/
- Untar the file
$ tar -xvzf hadoop-3.3.6.tar.gz
- Rename the folder as hadoop3
$ mv hadoop-3.3.6 hadoop3 $ sudo chown -R hduser:hadoop hadoop3 $ sudo chmod g+w -R hadoop3
E6. Configure passphraseless ssh for Hadoop
-
Ensure that you are login with the hduser
-
Ensure that you can SSH to the localhost in Ubuntu
- To ssh to localhost without a passphrase, run the following command to initialize your private and public keys:
$ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa $ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys $ chmod 0600 ~/.ssh/authorized_keys
- Test the configuration
$ ssh localhost
Exit the ssh by issuing command exit.
E7. Configure Pseudo-distributed Mode for Hadoop
-
Ensure that you are login with the hduser
- Setup the environment variables in the ~/.bashrc file (for hduser) by adding the environment variables to the end of the file
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 export HADOOP_HOME=/home/hduser/hadoop3 export PATH=$PATH:$HADOOP_HOME/bin
- Source the file:
$ source ~/.bashrc
- Change directory to the Hadoop folder
$ cd hadoop3
- Edit the etc/hadoop/hadoop-env.sh file by adding the following environment variables at the end of file
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 export HADOOP_CLASSPATH=${JAVA_HOME}/lib/tools.jar export HADOOP_LOG_DIR=${HADOOP_HOME}/logs
- Verify if the installation was successful
$ hadoop
Running the above command should display various options.
- Edit the etc/hadoop/core-site.xml file by adding the following configuration.
<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://localhost:9000</value> </property> </configuration>
- Edit the etc/hadoop/hdfs-site.xml file by adding the following configuration.
<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>/home/hduser/hadoopData/dfs/data</value> <final>true</final> </property> <property> <name>dfs.namenode.name.dir</name> <value>/home/hduser/hadoopName/dfs/data</value> <final>true</final> </property> </configuration>
- Edit the etc/hadoop/mapred-site.xml file by adding the following configuration.
<configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> <property> <name>mapreduce.application.classpath</name> <value>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*</value> </property> </configuration>
- Edit the etc/hadoop/yarn-site.xml file by adding the following configuration.
<configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.env-whitelist</name> <value>JAVA_HOME, HADOOP_COMMON_HOME, HADOOP_HDFS_HOME, HADOOP_CONF_DIR, CLASSPATH_PREPEND_DISTCACHE, HADOOP_YARN_HOME, HADOOP_MAPRED_HOME</value> </property> <property> <name>yarn.resourcemanager.address</name> <value>127.0.0.1:8032</value> </property> </configuration>
- Format the namenode
$ bin/hdfs namenode -format
Running the above command should display information on formating the namenode of Hadoop with installed Java. You should observe the namenode formated and terminated without any error.
- Start the Distributed File System (DFS) service
$ sbin/start-dfs.sh
Start the NameNode and DataNode daemons
$ jps
Check the status. If the NameNode and DataNode services are initiated successfully, you should see these four processes
DataNode Jps NameNode SecondaryNameNode
Browse any web browser for the NameNode, by default, and it is available at URL http://localhost:9870/
- Start the Yet Another Resource Negotiator (YARN) service
$ sbin/start-yarn.sh
Start the Yarn daemon
$ jps
Check the status. If the Yarn services are initiated successfully, you should see a total of six processes (included these two).
ResourceManager NodeManager
Browse any web browser for the ResourceManager, by default, and it is available at URL http://localhost:8088/
- Create HDFS directories required to execute MapReduce jobs
$ hdfs dfs -mkdir /user $ hdfs dfs -mkdir /user/hduser
E8. Run a sample MapReduce Job in Hadoop
-
Ensure that you are login with the hduser
- Copy the input files into the distributed file system
$ hdfs dfs -mkdir input $ hdfs dfs -put etc/hadoop/*.xml input
- Run a MapReduce job from an example
$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.6.jar grep input output 'dfs[a-z.]+'
- View the output files on the distributed system
$ hdfs dfs -cat output/* Output: 1 dfsadmin 1 dfs.replication
- Stop all the daemons
$ sbin/stop-yarn.sh $ sbin/stop-dfs.sh
- Logout from the hduser account and your tarumt account.
E9. Attention: At the beginning of all future practical
- Login as hduser or switch account to hduser
$ su - hduser
- Start ssh,
$ sudo service ssh start
- Start the HDFS service
$ sbin/start-dfs.sh $ jps
Suppose you need to observe at least four (4) services, including both NameNode and DataNode, as stated in step 12 of G7. Otherwise, you may need to reformat the HDFS NameNode. Wait for the format process to complete without errors before proceeding. Note that formatting the NameNode will result in loss of all data files in HDFS. You also need to recreate the directory /user and its sub-directory /user/hduser after starting the HDFS service and before moving on to the next steps.
$ cd ~/hadoop3 $ bin/hdfs namenode -format $ sbin/start-dfs.sh $ hdfs dfs -mkdir /user $ hdfs dfs -mkdir /user/hduser
- Start the YARN service
$ sbin/start-yarn.sh $ jps
Suppose you need to observe at least six (6) services in the total, refer steps 12 and 13 of G7.
E10. Attention: At the end of all future practical
- Stop the YARN service
$ sbin/stop-yarn.sh
- Stop the HDFS service
$ sbin/stop-dfs.sh
Issue command top to examine all expected services terminated, and use Ctrl-c to exit from command top
- Exit from your user account(s)
exit