Tan Choo Jun

hadoop

F. HDFS File Operations and MapReduce

This practical introduces the basic Hadoop Distributed File System (HDFS) operations.
Login as hduser, and the HDFS shell can be invoked using
```
$ hdfs dfs <args>  
```
To see the available commands in the shell
```
$ hdfs dfs -help
```
WordCount.zip
StreamingOn-time.zip

F1. Basic File Operations

Download a file with file ID 122PnuKaSaA_OyYOKnxQOdlMc5awdyf5v from Google Drive

$ wget --no-check-certificate 'https://docs.google.com/uc?export=download&id=122PnuKaSaA_OyYOKnxQOdlMc5awdyf5v' -O shakespeare.txt

Copy the downloaded file shakespeare.txt to the distributed file system
```
$ hdfs dfs -put shakespeare.txt shakespeare.txt
```
You may apply the option -f to force overwrite the destination file in the distributed file system, e.g.
```
$ hdfs dfs -put -f shakespeare.txt shakespeare.txt
```
Make a directory named corpora in the HDFS file system
```
$ hdfs dfs -mkdir corpora
```
Read the contents of the file using the cat command, and then pipe the output to less in order to view the contents of the remote file
```
$ hdfs dfs -cat shakespeare.txt | less
```
Use the arrow keys to navigate the file. Type q to quit.
Copy a file from the distributed file system to the local file system
```
$ hdfs dfs -get shakespeare.txt ./shakespeare-dfs.txt
```
To change the permission of the file shakespeare.txt to 664
```
$ hdfs dfs -chmod 664 shakespeare.txt 
```
664 is an octal representation of the flags to set for the permission triple. The above statement changes the permissions to -rw-rw-r–.
6 is 110, which means read and write, but not execute.
7 is 111, which means complete permissions.
4 is 100, which means read-only.

F2. Advanced File Operations

To view the contents of your current directory
```
$ hdfs dfs -ls
```
To view the contents of a non-empty directory
```
$ hdfs dfs -ls /user
```
To create a test directory, e.g., testHDFS. Note that it will appear within the HDFS.
```
$ hdfs dfs -mkdir testHDFS
```
Now you must verify that the created directory by listing your HDFS. You should see the testHDFS directory listed, if it is created at the home directory of hduser.
```
$ hdfs dfs -ls /user/hduser
```
Create a file, which you wish to copy from local file system to HDFS
```
$ echo "HDFS test file" >> testFile
```
Note that that is going to create a new file named testFile, including the characters HDFS test file. To verify this, input:
```
$ ls
```
To verify that the file was created
```
$ cat testFile
```
To copy the file to HDFS from Linux local file system into HDFS
```
$ hdfs dfs -copyFromLocal testFile
```
To copy files from your local machine to HDFS, use the command -copyFromLocal. The command -cp is only used to copy files within HDFS. For more options and flexibility in copying files or directories to your desired destination within HDFS, consider using the alternative command shown below
```
hdfs dfs -put <Linux local file system> <distributed file system>
```
Now you need to confirm that the file has been copied over correctly
```
$ hdfs dfs -ls
$ hdfs dfs -cat testFile
```
Now you can move it into the testHDFS directory you have already created
```
$ hdfs dfs -mv testFile testHDFS
$ hdfs dfs -ls
$ hdfs dfs -ls testHDFS/
```
The first command moved your testFile from the HDFS home directory into the test one you created. The second command of this command then shows us that it’s no longer in the HDFS home directory, and the third command confirms that it’s now been moved to the test HDFS directory.

To copy a file

$ hdfs dfs -cp testHDFS/testFile testHDFS/testFile2
$ hdfs dfs -ls testHDFS/

Checking disk space is useful when you’re using HDFS. It can be identified with the following command
```
$ hdfs dfs -du
```
To view how much space is available in HDFS across the Hadoop cluster
```
$ hdfs dfs -df
```

To to delete a file or directory in the HDFS

$ hdfs dfs -rm testHDFS/testFile
$ hdfs dfs -ls testHDFS/

You will observe that you still have the testHDFS directory and testFile2 leftover that you created. Remove the directory with the following command
```
$ hdfs dfs -rm -r testHDFS
$ hdfs dfs -ls
```
In addition to the above commands, there are a number of POSIX-like commands (https://en.wikipedia.org/wiki/List_of_POSIX_commands) which include chgrp, chown, cp, du, mkdir, stat, tail

F3. MapReduce with Java - Word Count

Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner. A MapReduce job splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks. The MapReduce framework operates exclusively on <key, value> pairs, that is, the framework views the input to the job as a set of <key, value> pairs and produces a set of <key, value> pairs as the output of the job, conceivably of different types.
Input and Output types of a MapReduce job:

(input) <k1, v1> -> map -> <k2, v2> -> combine -> <k2, v2> -> reduce -> <k3, v3> (output)
Read more on MapReduce at URLs https://hadoop.apache.org/docs/r3.3.6/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html and https://en.wikipedia.org/wiki/MapReduce

Login as hduser, and make a copy of the C:\de\WordCount folder in the local hduser’s home directory

$ sudo cp -r /mnt/c/de/WordCount /home/hduser
$ sudo chown hduser:hduser -R /home/hduser/WordCount

Change directory to the WordCount directory
```
$ cd WordCount
```
Review and understand the code in WordCount.java. Compile WordCount.java and create a jar file. Check the contents of the directory after each of the following statements
```
$ cat ~/WordCount/WordCount.java
$ hadoop com.sun.tools.javac.Main WordCount.java
$ jar cf wc.jar WordCount*.class
$ ls ~/WordCount
```
Make sure file shakespeare.txt exist in the distributed file system before submitting the job to the Hadoop cluster
```
$ hdfs dfs -ls /user/hduser
$ hadoop jar wc.jar WordCount shakespeare.txt wordcounts
```
You may copy the Linux local system to the distributed system using the following command
```
$ hdfs dfs -put ~/WordCount/shakespeare.txt /user/hduser
```
Examine result of the job by running cat on the part file from the distributed file system and pipe it with the less
```
$ hdfs dfs -cat wordcounts/part-r-00000 | less
```
Exit the piped environment using Ctrl-z
List all current running jobs (while and before the running job completed)
```
$ mapred job -list
```

F4. MapReduce with Python - On-time Performance of Flights

Login as hduser, and make a copy of the C:\de\StreamingOn-time folder in the local hduser’s home directory

$ sudo cp -r /mnt/c/de/StreamingOn-time /home/hduser
$ sudo chown hduser:hduser -R /home/hduser/StreamingOn-time

Change directory to the StreamingOn-time directory. Test the scripts without the Hadoop overhead by simulating the MapReduce pipeline using Linux pipes and the sort command.
```
$ cd StreamingOn-time
$ cat flights.csv | ./mapper.py | sort | ./reducer.py
```
Suppose that you may issue the following commands if you have observed the error of /usr/bin/env: ‘python’: No such file or directory. Re-execute the same command to retrying
```
$ sudo apt update
$ sudo apt install python-is-python3
```

Create a required directory in the distributed file system, and you may review the created directory before copying the required file flights.csv to the destination directory

$ hdfs dfs -mkdir /user/hduser/StreamingOn-time
$ hdfs dfs -ls /user/hduser
$ hdfs dfs -ls /user/hduser/StreamingOn-time
$ hdfs dfs -put /home/hduser/StreamingOn-time/flights.csv /user/hduser/StreamingOn-time
$ hdfs dfs -ls /user/hduser/StreamingOn-time

Execute the Streaming job in the hadoop cluster, and examine the new created directory as well as output

$ hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-*.jar -input StreamingOn-time/flights.csv -output StreamingOn-time/average_delay -mapper mapper.py -reducer reducer.py -file mapper.py -file reducer.py
$ hdfs dfs -ls /user/hduser/StreamingOn-time
$ hdfs dfs -ls /user/hduser/StreamingOn-time/average_delay

Copy the HDFS’ StreamingOn-time/average_delay directory to your Linux local directory, and list the contents of the local system’s directory to check if the folder from HDFS was copied successfully
```
$ hdfs dfs -copyToLocal /user/hduser/StreamingOn-time/average_delay
$ ls /home/hduser/StreamingOn-time/average_delay
```
Check the contents of the average_delay folder by viewing the content of the beginning of the part-00000 file
```
$ head /home/hduser/StreamingOn-time/average_delay/part-00000
```