Tan Choo Jun

hadoop

E. Hadoop Installation and Configuration

Throughout our practical, we will assume that the Hadoop user name is hduser.
The first time that you launch your WSL Linux distro, you will be prompted to create a default user account. You may choose to either
- name the default user account as tarumt, and
- create a separate user account named hduser.
The NameNode is the hardware that contains the GNU/Linux operating system and software. The Hadoop distributed file system acts as the master server and can manage the files, control a client’s access to files, and overseas file operating processes such as renaming, opening, and closing files. A DataNode is hardware having the GNU/Linux operating system and DataNode software. For every node in a HDFS cluster, you will locate a DataNode. These nodes help to control the data storage of their system as they can perform operations on the file systems if the client requests, and also create, replicate, and block files when the NameNode instructs.
The HDFS file system consists of a set of Master services (NameNode, secondary NameNode, and DataNodes). The NameNode and secondary NameNode manage the HDFS metadata. The DataNodes host the underlying HDFS data. The NameNode tracks which DataNodes contain the contents of a given file in HDFS. HDFS divides files into blocks and stores each block on a DataNode. Multiple DataNodes are linked to the cluster. The NameNode then distributes replicas of these data blocks across the cluster. It also instructs the user or application where to locate wanted information.
Read more on Hadoop at URL https://en.wikipedia.org/wiki/Apache_Hadoop and https://hadoop.apache.org/

E1. Setup User Environment

Create a new group named hadoop
```
$ sudo addgroup hadoop
```
Create a new user account named hduser (if applicable) Add the new user:
```
$ sudo adduser hduser
$ sudo adduser student
```

Grant the user sudo privileges

$ sudo usermod -aG sudo hduser
$ sudo usermod -aG sudo student

Add hduser to the hadoop group

$ sudo usermod -a -G hadoop hduser
$ sudo usermod -a -G hadoop student

Switch to the user account hduser (if applicable)
```
$ su - hduser
$ su - student
```
If you want to go back to your original user session
```
$ exit
```

E2. Setup Operating System Environment

Reboot/terminate Ubuntu in WSL, and run the following commands from Ubuntu with user hduser
Check if ssh has been installed
```
$ service ssh status
```

If ssh has not been installed, install ssh

$ sudo apt update
$ sudo apt upgrade
$ sudo apt install ssh

Install pdsh

$ sudo apt update
$ sudo apt upgrade
$ sudo apt install pdsh

Install Python (if necessary)

$ python3 --version 
$ sudo apt install software-properties-common
$ sudo apt update
$ sudo apt install python3
$ sudo apt install python3-dev
$ sudo apt install python3-wheel

Check current python versions and symlinks
```
$ ls -l /usr/bin/python* 
```
Set environment variables by editing your ~/.bashrc file (for hduser):
```
$ nano ~/.bashrc
```
In ~/.bashrc file, set Python 3 as the default python version, add the following command to set Python 3 as the default python version:
```
alias python=python3
```
In ~/.bashrc file, add the following lines at the end of the file based on the python3.x from your installation:
```
export CPATH=/usr/include/python3.10m:$CPATH
export LD_LIBRARY_PATH=/usr/lib:$LD_LIBRARY_PATH
```
Save the ~/.bashrc file.
Source the file:
```
$ source ~/.bashrc
```

Install pip (if necessary)

$ sudo apt update
$ sudo apt install python3-pip
$ pip3 --version
$ sudo pip3 install --upgrade pip

Check which Java version is installed in the distro
```
$ java -version
```
Install targeted OpenJDK
```
$ sudo apt-get install openjdk-8-jdk
```
After this, check if the correct version of Java is installed.

E3. Setup SSH and PDSH

Run the following commands from Ubuntu with user hduser

Secure Shell (SSH), also sometimes called Secure Socket Shell, is a protocol for securely accessing your site’s server over an unsecured network. In other words, it’s a way to safely log in to your server remotely using your preferred command-line interface.

Check if the SSH service is running
```
$ service ssh status
```
Start SSH
```
$ sudo service ssh start
```
Note that WSL does not automatic start sshd (SSH daemon) occasionally. Please start the sshd manually
Open the SSH port (if necessary)
```
$ sudo ufw allow ssh
```
Ubuntu comes with a firewall configuration tool, known as UFW. If the firewall is enabled on your system, make sure to open the targeted SSH port, i.e. port 22.
Reload SSH
```
$ sudo /etc/init.d/ssh reload
```
Modify pdsh’s default rcmd to ssh, by checking your pdsh default rcmd rsh:
```
$ pdsh -q -w localhost
```
Modify pdsh’s default rcmd to ssh, by adding the following command to ~/.bashrc file (for hduser)
```
export PDSH_RCMD_TYPE=ssh
```
Run the following command to ensure the settings applied.
```
$ source ~/.bashrc
```

E4. Disable IPv6

Run the following commands from Ubuntu with user hduser

Edit the /etc/sysctl.conf file with the following command.
```
$ sudo nano /etc/sysctl.conf
```

Add the following lines to the end of the sysctl.conf file:

# disable ipv6
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1

To reload the /etc/sysctl.conf settings, issue the following command.
```
$ sudo sysctl -p
```
Check if IPv6 has been successfully disabled
```
$ cat /proc/sys/net/ipv6/conf/all/disable_ipv6
```
0 means IPv6 is enabled; 1 means IPv6 is disabled.
Suppose that IPv6 is still enabled after rebooting, you must carry out the following:
- Create (with root privileges) the file /etc/rc.local
```
$ sudo nano /etc/rc.local
```
- Insert the following into the file /etc/rc.local:
```
#!/bin/bash
/etc/sysctl.d
/etc/init.d/procps restart
exit 0
```
- Make the file executable
```
sudo chmod 755 /etc/rc.local
```

E5. Install Hadoop

Switch to the hduser account
Download the Hadoop binary by finding the appropriate Hadoop binary from the Hadoop releases page.
We will use Hadoop 3.3.6 to avoid problems with HBase in a later practical. Read more at URL https://hbase.apache.org/book.html#hadoop
```
$ wget https://downloads.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz
```
Or, you may copy the downloaded tar.gz file manually to destination ~/hduser/
Untar the file
```
$ tar -xvzf hadoop-3.3.6.tar.gz
```

Rename the folder as hadoop3

$ mv hadoop-3.3.6 hadoop3
$ sudo chown -R hduser:hadoop hadoop3
$ sudo chmod g+w -R hadoop3

E6. Configure passphraseless ssh for Hadoop

Ensure that you are login with the hduser
Ensure that you can SSH to the localhost in Ubuntu

To ssh to localhost without a passphrase, run the following command to initialize your private and public keys:

$ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
$ chmod 0600 ~/.ssh/authorized_keys

Test the configuration
```
$ ssh localhost
```
Exit the ssh by issuing command exit.

E7. Configure Pseudo-distributed Mode for Hadoop

Ensure that you are login with the hduser

Setup the environment variables in the ~/.bashrc file (for hduser) by adding the environment variables to the end of the file

 export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
 export HADOOP_HOME=/home/hduser/hadoop3
 export PATH=$PATH:$HADOOP_HOME/bin

Source the file:
```
 $ source ~/.bashrc
```
Change directory to the Hadoop folder
```
 $ cd hadoop3
```

Edit the etc/hadoop/hadoop-env.sh file by adding the following environment variables at the end of file

 export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
 export HADOOP_CLASSPATH=${JAVA_HOME}/lib/tools.jar
 export HADOOP_LOG_DIR=${HADOOP_HOME}/logs

Verify if the installation was successful
```
 $ hadoop
```
Running the above command should display various options.

Edit the etc/hadoop/core-site.xml file by adding the following configuration.

 <configuration>
   <property>
    <name>fs.defaultFS</name>
    <value>hdfs://localhost:9000</value>
   </property>
 </configuration>

Edit the etc/hadoop/hdfs-site.xml file by adding the following configuration.

 <configuration>
   <property>
     <name>dfs.replication</name>
     <value>1</value>
   </property>
   <property>
     <name>dfs.datanode.data.dir</name>
     <value>/home/hduser/hadoopData/dfs/data</value>
     <final>true</final>
    </property>
    <property>
     <name>dfs.namenode.name.dir</name>
     <value>/home/hduser/hadoopName/dfs/data</value>
     <final>true</final>
    </property>
 </configuration>

Edit the etc/hadoop/mapred-site.xml file by adding the following configuration.

 <configuration>
   <property>
     <name>mapreduce.framework.name</name>
     <value>yarn</value>
   </property>
   <property>  
     <name>mapreduce.application.classpath</name>      
     <value>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*</value>
   </property>
 </configuration>

Edit the etc/hadoop/yarn-site.xml file by adding the following configuration.

<configuration>
  <property>
     <name>yarn.nodemanager.aux-services</name>
     <value>mapreduce_shuffle</value>
  </property>
  <property>
     <name>yarn.nodemanager.env-whitelist</name>    
     <value>JAVA_HOME, HADOOP_COMMON_HOME, HADOOP_HDFS_HOME, HADOOP_CONF_DIR, CLASSPATH_PREPEND_DISTCACHE, HADOOP_YARN_HOME, HADOOP_MAPRED_HOME</value>
  </property>
  <property>
     <name>yarn.resourcemanager.address</name>
     <value>127.0.0.1:8032</value>
  </property>
</configuration>

Format the namenode
```
 $ bin/hdfs namenode -format
```
Running the above command should display information on formating the namenode of Hadoop with installed Java. You should observe the namenode formated and terminated without any error.
Start the Distributed File System (DFS) service
```
 $ sbin/start-dfs.sh
```
Start the NameNode and DataNode daemons
```
 $ jps
```
Check the status. If the NameNode and DataNode services are initiated successfully, you should see these four processes
```
DataNode
Jps
NameNode
SecondaryNameNode
```
Browse any web browser for the NameNode, by default, and it is available at URL http://localhost:9870/
Start the Yet Another Resource Negotiator (YARN) service
```
 $ sbin/start-yarn.sh
```
Start the Yarn daemon
```
 $ jps
```
Check the status. If the Yarn services are initiated successfully, you should see a total of six processes (included these two).
```
ResourceManager
NodeManager
```
Browse any web browser for the ResourceManager, by default, and it is available at URL http://localhost:8088/

Create HDFS directories required to execute MapReduce jobs

 $ hdfs dfs -mkdir /user
 $ hdfs dfs -mkdir /user/hduser

E8. Run a sample MapReduce Job in Hadoop

Ensure that you are login with the hduser

Copy the input files into the distributed file system

$ hdfs dfs -mkdir input
$ hdfs dfs -put etc/hadoop/*.xml input

Run a MapReduce job from an example

$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.6.jar grep input output 'dfs[a-z.]+'

View the output files on the distributed system

$ hdfs dfs -cat output/*
Output:
1      dfsadmin
1      dfs.replication

Stop all the daemons
```
$ sbin/stop-yarn.sh
$ sbin/stop-dfs.sh
```
Logout from the hduser account and your tarumt account.

E9. Attention: At the beginning of all future practical

Login as hduser or switch account to hduser
```
$ su - hduser
```
Start ssh,
```
$ sudo service ssh start
```
Start the HDFS service
```
$ sbin/start-dfs.sh
$ jps
```
Suppose you need to observe at least four (4) services, including both NameNode and DataNode, as stated in step 12 of G7. Otherwise, you may need to reformat the HDFS NameNode. Wait for the format process to complete without errors before proceeding. Note that formatting the NameNode will result in loss of all data files in HDFS. You also need to recreate the directory /user and its sub-directory /user/hduser after starting the HDFS service and before moving on to the next steps.
```
$ cd ~/hadoop3
$ bin/hdfs namenode -format
$ sbin/start-dfs.sh
$ hdfs dfs -mkdir /user
$ hdfs dfs -mkdir /user/hduser
```
Start the YARN service
```
$ sbin/start-yarn.sh
$ jps
```
Suppose you need to observe at least six (6) services in the total, refer steps 12 and 13 of G7.

E10. Attention: At the end of all future practical

Stop the YARN service
```
$ sbin/stop-yarn.sh
```
Stop the HDFS service
```
$ sbin/stop-dfs.sh
```
Issue command top to examine all expected services terminated, and use Ctrl-c to exit from command top
Exit from your user account(s)
```
exit
```