Meta & resources
Machine Learning: concepts & procedures
Machine Learning: fundamental algorithms
Machine Learning: model assessment
Artificial neural networks
Natural language processing
The computer science appendix

Apache Spark

What is

Apache Spark is a general purpose open source distributed system for cluster computing on large datasets. It is natively written in Scala but has API's for other languages. It was originally developed at the University of California at Berkeley in 2009 and then donated to the Apache Foundation.

The figure above illustrates in a very rough way what the Apache Spark environment consists of. Spark ML and MLlib are the libraries for Machine Learning: the first, newer, works with DataFrames; the second with RDDs (Resilient Distributed Datasets, see below). API's exist in Scala (Spark native language), Java, R and Python.

The concepts illustrated briefly here in this note are reworked from the the refs, which is full of great material, including tutorials.

Distributed Data Model: RDDs and DataFrames

The core of the computation in Spark lies in the use of RDDs (Resilient Distributed Datasets). An RDD is a partitioned collection of elements that can be operated in parallel; it is recomputed on node failures and is created by the so-called SparkContext from input sources (local file system, the HDFS - see the Hadoop page - , ...)

RDDs have these features:

• are immutable once constructed;

• enable operators to run in parallel;

• are able to track info to recompute potential lost data;

• the more the partitions, the more parallelism

• are lazily evaluated

A (distributed) DataFrame can be built from an RDD and is conceptually similar to the corresponding R and Pandas ones.

The fact that RDDs are lazily evaluated means that no operation is actually executed until an action is called, and this allows to skip intermediate big results.

The Spark programming model

The SparkContext tells Spark how and where to access the cluster and is used to create the RDDs. Its master parameter sets the type and size of the cluster:

• local: run locally with a single worker

• local[4]: run locally with 4 workers (ideally to be set to the number of cores)

• spark://host:port: run on an external specified cluster

The figure above illustrates the relation between drivers and workers in Spark, FS is the file system.

Transformations and Actions

Transformations are meant to transform the RDD, creating a new dataset, are lazily applied, and the execution takes place only when an action is called. Actions cause all transformations to be executed.

Examples of transformation are map, filter, distinct; examples of actions are reduce, take, collect*.

The Spark programming model follows these steps:

• Create RDD from data by parallelisation

• Transform into a new RDD

• Cache some RDDs for reuse (\textit{persist})

• Use action to execute parallel computation and produce results

The problem with large global variables is that it is inefficient to send large data to each worker. To solve this, Spark has shared variables: broadcasts (send large, read-only values to the workers who save them) and accumulators (aggregate the values from the workers back to the driver and are write-only).

Generally speaking, Spark is faster: the use of memory guarantees it is roughly$~O(10^2)$faster. The use of disk in Hadoop is slow for complex jobs and interactive queries. Spark leverages the idea of keeping more data into memory.