spark

H. Spark and Machine Learning

Spark has two machine learning libraries with different APIs, but including implementations of similar algorithms. These machine learning algorithms and utilities have been designed to scale across a cluster by way of inclusion of parallel algorithms in which operations can be applied in parallel across nodes.
- spark.mllib: spark.mllib is the older built-in Spark machine learning library. It consists of learning algorithms using Spark’s RDD operations. To leverage this, you need to create data (as RDDs) and operate on data in a distributed, parallelizable manner to make them available to all nodes in the cluster.
- spark.ml: The newer library which is inspired by scikit-learn. It provides a uniform set of high-level API built on top of DataFrames for constructing and tuning machine learning pipelines.
As of Spark 2.0, the RDD-based APIs (https://spark.apache.org/docs/latest/rdd-programming-guide.html#resilient-distributed-datasets-rdds) in the spark.mllib package have entered maintenance mode. The primary Machine Learning API for Spark is now the DataFrame-based API (https://spark.apache.org/docs/latest/sql-programming-guide.html) in the spark.ml package. Therefore, Apache Spark recommends using spark.ml instead of spark.mllib. Read more at URL https://spark.apache.org/docs/latest/ml-guide.html#announcement-dataframe-based-api-is-primary-api
mllib.zip

H1. spark.mllib - Recommender Systems

This example uses Alternating Least Squares (ALS) algorithm to obtain potential matches for an online dating service

Make a copy of the C:\de\mllib folder in the local hduser’s home directory

$ sudo cp -r /mnt/c/de/mllib /home/hduser
$ sudo chown hduser:hduser -R /home/hduser/mllib

This example uses k-means clustering to determine which areas in the US have been most hit by earthquakes
Ensure that the data file earthquakes.csv exists in HDFS’ data directory. You may review the contents of the mllib/data directory before proceeding to the next step
```
$ hdfs dfs -ls /user/hduser/mllib/data
```
Review and understand the code in earthquakes_clustering.py before submitting the earthquakes_clustering.py job to Spark and piping the output to a file
```
$ hdfs dfs -cat /user/hduser/mllib/earthquakes_clustering.py | more
$ $SPARK_HOME/bin/spark-submit mllib/earthquakes_clustering.py hdfs://localhost:9000/user/hduser/mllib/data/earthquakes.csv 6 > clusters.txt
```
Note that this exercise takes the data file name, located in the distributed file system, as a parameter
Examine the output in the file cluster.txt

This example uses the Naive Bayes classifier to perform multiclass classification
Ensure that the data file sample_libsvm_data.txt exists in HDFS’ data directory. You may review the contents of the mllib/data directory before proceeding to the next step
```
$ hdfs dfs -ls /user/hduser/mllib/data
```
Review and understand the code in naive_bayes_example.py before submitting the naive_bayes_example.py job to Spark
```
$ hdfs dfs -cat /user/hduser/mllib/naive_bayes_example.py | more
$ $SPARK_HOME/bin/spark-submit mllib/naive_bayes_example.py
```
Note that you need change the correct file path inside the naive_bayes_example.py, e.g. from data/*.txt to hdfs://localhost:9000/user/hduser/mllib/data/*.txt
Examine the output in the terminal

This example uses Alternating Least Squares (ALS) algorithm for movie recommendations
Ensure that the data file sample_movielens_ratings.txt exists in HDFS’ data directory. You may review the contents of the mllib/data directory before proceeding to the next step
```
$ hdfs dfs -ls /user/hduser/mllib/data
```
Review and understand the code in als_example.py before submitting the naive_bayes_example.py job to Spark
```
$ hdfs dfs -cat /user/hduser/mllib/als_example.py | more
$ $SPARK_HOME/bin/spark-submit mllib/als_example.py
```
Note that you need change the correct file path inside the als_example.py, e.g. from data/*.txt to hdfs://localhost:9000/user/hduser/mllib/data/*.txt
Examine the output in the terminal