Sunday, 10 July 2016

' Spark ' by ' Spark '

Installing the "Awesome" Spark:

It has been a fun filled journey after repeated failures of installing the virtual box >> importing ubuntu into it >> updating Java >> downloading Spark >> run a program . And this has been a painful and tedious work baring my 4gb ram of which i allocated 2.5 gb for the virtual box. 
Inspite of knowing all these steps there was always some struggle in executing these steps. And after repeated failure and at the moment of giving up the tedious installation process, Someone who is always curious and excited about technology came up with a single file that really saved myself from getting into a state where i might have ended up hating the wonderful technology named "Spark".
    Upon getting the gift "PraxiSpark.ova", a sudden excitement sprang upon with a hope that assured "Spark"can be learnt. 
And it would be a crime if I don explain such simple instruction process which was further simplified by our professor.
"Open the "PraxiSpark.ova" using the Virtual Box that is already installed and there it everything is inbuilt. " !!!! Yes, this happens. 
I really puzzled how the stunt was pulled but hiding my excitement I pulled up my socks for a wonderful  journey of Big data.

Magneta Colored Beauty "Command window" :
  After opening the command window, type pyspark and after few minutes of silence filled with confused stare at the screen trying to figure out what the successive running lines mean, a slanted broken line word called SPARK appears and thats the first success which assures that  the system is all ready for the battle of execution of complex programs.






 

A small introduction to our star "Spark":
            Spark is a general purpose cluster computing framework that provides efficient in-memory computations for large data sets by distributing computation across multiple computers.
 If you're familiar with Hadoop, then you know that any distributed computing framework needs to solve two problems: how to distribute data and how to distribute computation. Hadoop uses- HDFS to solve the distributed data problem and MapReduce as the programming paradigm that provides effective distributed computation. Similarly, Spark has a functional programming API in multiple languages that provides more operators than map and reduce, and does this via a distributed data framework called resilient distributed datasets or RDDs.
         RDDs are essentially a programming abstraction that represents a read-only collection of objects that are partitioned across machines. RDDs can be rebuilt from a lineage (and are therefore fault tolerant), are accessed via parallel operations, can be read from and written to distributed storages like HDFS or S3, and most importantly, can be cached in the memory of worker nodes for immediate reuse. Because RDDs can be cached in memory, Spark is extremely effective at iterative applications, where the data is being reused throughout the course of an algorithm. Most machine learning and optimization algorithms are iterative, making Spark an extremely effective tool for data science. Additionally, because Spark is so fast, it can be accessed in an interactive fashion via a command line prompt similar to the Python REPL.
      The Spark library itself contains a lot of the application elements that have found their way into most Big Data applications including support for SQL-like querying of big data, machine learning and graph algorithms, and even support for live streaming data.
The core components are:
  • Spark Core: Contains the basic functionality of Spark; in particular the APIs that define RDDs and the operations and actions that can be undertaken upon them. The rest of Spark's libraries are built on top of the RDD and Spark Core.
  • Spark SQL: Provides APIs for interacting with Spark via the Apache Hive variant of SQL called Hive Query Language (HiveQL). Every database table is represented as an RDD and Spark SQL queries are transformed into Spark operations. For those that are familiar with Hive and HiveQL, Spark can act as a drop-in replacement.
  • Spark Streaming: Enables the processing and manipulation of live streams of data in real time. Many streaming data libraries (such as Apache Storm) exist for handling real-time data. Spark Streaming enables programs to leverage this data similar to how you would interact with a normal RDD as data is flowing in.
  • MLlib: A library of common machine learning algorithms implemented as Spark operations on RDDs. This library contains scalable learning algorithms like classifications, regressions, etc. that require iterative operations across large data sets. The Mahout library, formerly the Big Data machine learning library of choice, will move to Spark for its implementations in the future.
  • GraphX: A collection of algorithms and tools for manipulating graphs and performing parallel graph operations and computations. GraphX extends the RDD API to include operations for manipulating graphs, creating subgraphs, or accessing all vertices in a path.
        Because these components meet many Big Data requirements as well as the algorithmic and computational requirements of many data science tasks, Spark has been growing rapidly in popularity. Not only that, but Spark provides APIs in Scala, Java, and Python; meeting the needs for many different groups and allowing more data scientists to easily adopt Spark as their Big Data solution.

Word Count in PySpark:

After the fun filled installation process, its time to get some serious tasks to be done. The task of performing a word count from a text file is given. By having a sample code as reference , a word count was performed on text file " blind.txt" taken from textfiles.com which is repository of text files containing wide range of topics.

Step by step approach for the word count program in pyspark:

1. The initial call to the textFile method of variable sc (SparkContext) to create the first resilient distributed dataset (RDD) by reading lines from each file in the specified directory on HDFS, subsequent calls transfrom each input RDD into a new output RDD. We'll consider a simple example where we start by creating an RDD with just two lines with sc.parallelize, rather than reading the data from files with sc.textFile, and trace what each step in our wordcount program does.




2.  flatMap( <function> )
flatMap applies a function which takes each input value and returns a list. Each value of the list becomes a new, separate value in the output RDD
In our example, the lines are split into words and then each word becomes a separate value in the output RDD
map( <function> )
map returns a new RDD containing values created by applying the supplied function to each value in the original RDD.
3. map a lambda function to the data which will swap over the first and second values in each tuple, now the word count appears in the first position and the word in the second position.
4.  In order to inspect the lineage so far, we can use the toDebugString method to see how our PipelinedRDD is being transformed. We can then apply the reduceByKey action to get our word counts and then write those word counts to disk.
5. Finally the word count output which is in the form of tuples are saved as text file along with the count of the words,

























In this post we saw how we apply transformations to Spark RDD using the python api PySpark. We also dug up a little into the details of the transformations in a useful wordcount example. Next we will look at more involved transformations available to use in the Spark API.

K-means Clustering in Spark-shell using Scala:

            Before jumping into the world of machine learning, a wonderful library has to be acknowledged "MLlib". This is one of the most wonderful creation of Apache Spark which paves our way into world of Machine Learning. More complex algorithm are solved with ease with a fraction of second. This gives a huge advantage for Spark with respect to other platform performing machine Learning operations. 

 Here comes the "MLlib" :

MLlib is a scalable machine learning library built on top of Spark. As of version 1.0, the library is a work in progress. The main components of the library are:
  • Classification algorithms, including logistic regression, Naïve Bayes and support vector machines
  • Clustering limited to K-means in version 1.0
  • L1 & L1 Regularization
  • Optimization techniques such as gradient descent, logistic gradient and stochastic gradient descent and L-BFGS
  • Linear algebra such as Singular Value Decomposition
  • Data generator for K-means, logistic regression and support vector machines.

    Benefits of MLlib:
                Part of Spark   Integrated workflow  Scala, Java & Python API  Broad coverage of applications & algorithms  Rapid improvements in speed & robustness  Ongoing development & Large community  Easy to use, well documented

K-means clustering Using Scala:

Let's consider the K-means clustering components bundled with Apache Spark MLlib. The K-means configuration parameters are:
  • K Number of clusters
  • maxNumIters Maximum number of iterations for the minimizing the reconstruction error
  • numRuns Number of runs or episode used for training the clusters
  • caching Specify whether the resulting RDD has to be cached in memory
  • xt The array of data points (type Array[Double])
  • sc Implicit Spark context

Spark-Shell >> Scala :
1. After opening the command window, the path is set that leads to scala folder located in the local disk.
2. E:\>>cd bin/ spark-shell.

3.  After importing the spark library to perform the clustering algorithm, the data, which is a text file containing the default Kmeans program from the examples given in the spark folder is loaded.
4. The data is then parsed.

5. Then we initialize the number of clusters, k and the maximum number of iterations that can occur in the clustering process. Here we initialize k=2 as optimal number of clusters.
6.  The Training model is built upon using the MLlib package that we have imported.
7.  The built model is evaluated by measuring the within sum of squared of the clusters.