$ sudo addgroup hadoop
$ sudo adduser hduser
$ sudo usermod -aG sudo hduser
$ sudo usermod -a -G hadoop hduser
$ su - hduser
$ exit
Reboot/terminate Ubuntu in WSL, and run the following commands from Ubuntu with user hduser
$ service ssh status
$ sudo apt update
$ sudo apt upgrade
$ sudo apt install ssh
$ sudo apt update
$ sudo apt upgrade
$ sudo apt install pdsh
$ python3 --version
$ sudo apt install software-properties-common
$ sudo apt update
$ sudo apt install python3
$ sudo apt install python3-dev
$ sudo apt install python3-wheel
$ ls -l /usr/bin/python*
$ nano ~/.bashrc
alias python=python3
export CPATH=/usr/include/python3.10m:$CPATH
export LD_LIBRARY_PATH=/usr/lib:$LD_LIBRARY_PATH
Save the ~/.bashrc file.
$ source ~/.bashrc
$ sudo apt update
$ sudo apt install python3-pip
$ pip3 --version
$ sudo pip3 install --upgrade pip
$ java -version
$ sudo apt-get install openjdk-8-jdk
After this, check if the correct version of Java is installed.
Run the following commands from Ubuntu with user hduser
Secure Shell (SSH), also sometimes called Secure Socket Shell, is a protocol for securely accessing your site’s server over an unsecured network. In other words, it’s a way to safely log in to your server remotely using your preferred command-line interface.
$ service ssh status
$ sudo service ssh start
Note that WSL does not automatic start sshd (SSH daemon) occasionally. Please start the sshd manually
$ sudo ufw allow ssh
Ubuntu comes with a firewall configuration tool, known as UFW. If the firewall is enabled on your system, make sure to open the targeted SSH port, i.e. port 22.
$ sudo /etc/init.d/ssh reload
$ pdsh -q -w localhost
export PDSH_RCMD_TYPE=ssh
$ source ~/.bashrc
Run the following commands from Ubuntu with user hduser
$ sudo nano /etc/sysctl.conf
# disable ipv6
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1
$ sudo sysctl -p
$ cat /proc/sys/net/ipv6/conf/all/disable_ipv6
0 means IPv6 is enabled; 1 means IPv6 is disabled.
Suppose that IPv6 is still enabled after rebooting, you must carry out the following:
$ sudo nano /etc/rc.local
#!/bin/bash
/etc/sysctl.d
/etc/init.d/procps restart
exit 0
sudo chmod 755 /etc/rc.local
Switch to the hduser account
We will use Hadoop 3.3.6 to avoid problems with HBase in a later practical. Read more at URL https://hbase.apache.org/book.html#hadoop
$ wget https://downloads.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz
Or, you may copy the downloaded tar.gz file manually to destination ~/hduser/
$ tar -xvzf hadoop-3.3.6.tar.gz
$ mv hadoop-3.3.6 hadoop3
$ sudo chown -R hduser:hadoop hadoop3
$ sudo chmod g+w -R hadoop3
Ensure that you are login with the hduser
Ensure that you can SSH to the localhost in Ubuntu
$ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
$ chmod 0600 ~/.ssh/authorized_keys
$ ssh localhost
Exit the ssh by issuing command exit.
Ensure that you are login with the hduser
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_HOME=/home/hduser/hadoop3
export PATH=$PATH:$HADOOP_HOME/bin
$ source ~/.bashrc
$ cd hadoop3
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_CLASSPATH=${JAVA_HOME}/lib/tools.jar
export HADOOP_LOG_DIR=${HADOOP_HOME}/logs
$ hadoop
Running the above command should display various options.
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/home/hduser/hadoopData/dfs/data</value>
<final>true</final>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/home/hduser/hadoopName/dfs/data</value>
<final>true</final>
</property>
</configuration>
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.application.classpath</name>
<value>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*</value>
</property>
</configuration>
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.env-whitelist</name>
<value>JAVA_HOME, HADOOP_COMMON_HOME, HADOOP_HDFS_HOME, HADOOP_CONF_DIR, CLASSPATH_PREPEND_DISTCACHE, HADOOP_YARN_HOME, HADOOP_MAPRED_HOME</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>127.0.0.1:8032</value>
</property>
</configuration>
$ bin/hdfs namenode -format
Running the above command should display information on formating the namenode of Hadoop with installed Java. You should observe the namenode formated and terminated without any error.
$ sbin/start-dfs.sh
Start the NameNode and DataNode daemons
$ jps
Check the status. If the NameNode and DataNode services are initiated successfully, you should see these four processes
DataNode Jps NameNode SecondaryNameNode
Browse any web browser for the NameNode, by default, and it is available at URL http://localhost:9870/
$ sbin/start-yarn.sh
Start the Yarn daemon
$ jps
Check the status. If the Yarn services are initiated successfully, you should see a total of six processes (included these two).
ResourceManager NodeManager
Browse any web browser for the ResourceManager, by default, and it is available at URL http://localhost:8088/
$ hdfs dfs -mkdir /user
$ hdfs dfs -mkdir /user/hduser
Ensure that you are login with the hduser
$ hdfs dfs -mkdir input
$ hdfs dfs -put etc/hadoop/*.xml input
$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.6.jar grep input output 'dfs[a-z.]+'
$ hdfs dfs -cat output/*
Output:
1 dfsadmin
1 dfs.replication
$ sbin/stop-yarn.sh
$ sbin/stop-dfs.sh
$ su - hduser
$ sudo service ssh start
$ sbin/start-dfs.sh
$ jps
Suppose you need to observe at least four (4) services, including both NameNode and DataNode, as stated in step 12 of G7. Otherwise, you may need to reformat the HDFS NameNode. Wait for the format process to complete without errors before proceeding. Note that formatting the NameNode will result in loss of all data files in HDFS. You also need to recreate the directory /user and its sub-directory /user/hduser after starting the HDFS service and before moving on to the next steps.
$ cd ~/hadoop3 $ bin/hdfs namenode -format $ sbin/start-dfs.sh $ hdfs dfs -mkdir /user $ hdfs dfs -mkdir /user/hduser
$ sbin/start-yarn.sh
$ jps
Suppose you need to observe at least six (6) services in the total, refer steps 12 and 13 of G7.
$ sbin/stop-yarn.sh
$ sbin/stop-dfs.sh
Issue command top to examine all expected services terminated, and use Ctrl-c to exit from command top
exit