Introduction to Spark

Spark Essentials: RDDs, Transformations and Actions

This introductory notebook describes how to get started running Spark (Scala) code in Notebooks.
Working with Spark's Resilient Distributed Datasets (RDDs)
- creating RDDs
- performing basic transformations on RDDs
- performing basic actions on RDDs

RECOLLECT from 001_WhySpark notebook and AJ's videos (did you watch the ones marked Watch Now? if NOT we should watch NOW!!! This was last night's "Viewing Home-Work") that Spark does fault-tolerant, distributed, in-memory computing

Spark Cluster Overview:

Driver Program, Cluster Manager and Worker Nodes

The driver does the following:

connects to a cluster manager to allocate resources across applications
acquire executors on cluster nodes
- executor processs run compute tasks and cache data in memory or disk on a worker node
sends application (user program built on Spark) to the executors
sends tasks for the executors to run
- task is a unit of work that will be sent to one executor

See http://spark.apache.org/docs/latest/cluster-overview.html for an overview of the spark cluster.

%md
## The Abstraction of Resilient Distributed Dataset (RDD)

#### RDD is a fault-tolerant collection of elements that can be operated on in parallel

#### Two types of Operations are possible on an RDD

* Transformations
* Actions

**(watch now 2:26)**:

[![RDD in Spark by Anthony Joseph in BerkeleyX/CS100.1x](http://img.youtube.com/vi/3nreQ1N7Jvk/0.jpg)](https://www.youtube.com/watch?v=3nreQ1N7Jvk?rel=0&autoplay=1&modestbranding=1&start=1&end=146)


***

## Transformations
**(watch now 1:18)**:

[![Spark Transformations by Anthony Joseph in BerkeleyX/CS100.1x](http://img.youtube.com/vi/360UHWy052k/0.jpg)](https://www.youtube.com/watch?v=360UHWy052k?rel=0&autoplay=1&modestbranding=1)

***


## Actions
**(watch now 0:48)**:

[![Spark Actions by Anthony Joseph in BerkeleyX/CS100.1x](http://img.youtube.com/vi/F2G4Wbc5ZWQ/0.jpg)](https://www.youtube.com/watch?v=F2G4Wbc5ZWQ?rel=0&autoplay=1&modestbranding=1&start=1&end=48)

***

## Key Points

* Resilient distributed datasets (RDDs) are the primary abstraction in Spark.
* RDDs are immutable once created:
    * can transform it.
    * can perform actions on it.
    * but cannot change an RDD once you construct it.
* Spark tracks each RDD's lineage information or recipe to enable its efficient recomputation if a machine fails.
* RDDs enable operations on collections of elements in parallel.
* We can construct RDDs by:
    * parallelizing Scala collections such as lists or arrays
    * by transforming an existing RDD,
    * from files in distributed file systems such as (HDFS, S3, etc.).
* We can specify the number of partitions for an RDD
* The more partitions in an RDD, the more opportunities for parallelism
* There are **two types of operations** you can perform on an RDD:
    * **transformations** (are lazily evaluated) 
      * map
      * flatMap
      * filter
      * distinct
      * ...
    * **actions** (actual evaluation happens)
      * count
      * reduce
      * take
      * collect
      * takeOrdered
      * ...
* Spark transformations enable us to create new RDDs from an existing RDD.
* RDD transformations are lazy evaluations (results are not computed right away)
* Spark remembers the set of transformations that are applied to a base data set (this is the lineage graph of RDD) 
* The allows Spark to automatically recover RDDs from failures and slow workers.
* The lineage graph is a recipe for creating a result and it can be optimized before execution.
* A transformed RDD is executed only when an action runs on it.
* You can also persist, or cache, RDDs in memory or on disk (this speeds up iterative ML algorithms that transforms the initial RDD iteratively).
* Here is a great reference URL for working with Spark.
    * [The latest Spark programming guide](http://spark.apache.org/docs/latest/programming-guide.html).

The Abstraction of Resilient Distributed Dataset (RDD)

RDD is a fault-tolerant collection of elements that can be operated on in parallel

Two types of Operations are possible on an RDD

Transformations
Actions

(watch now 2:26):

Transformations

(watch now 1:18):

Actions

(watch now 0:48):

Key Points

Resilient distributed datasets (RDDs) are the primary abstraction in Spark.
RDDs are immutable once created:
- can transform it.
- can perform actions on it.
- but cannot change an RDD once you construct it.
Spark tracks each RDD's lineage information or recipe to enable its efficient recomputation if a machine fails.
RDDs enable operations on collections of elements in parallel.
We can construct RDDs by:
- parallelizing Scala collections such as lists or arrays
- by transforming an existing RDD,
- from files in distributed file systems such as (HDFS, S3, etc.).
We can specify the number of partitions for an RDD
The more partitions in an RDD, the more opportunities for parallelism
There are two types of operations you can perform on an RDD:
- transformations (are lazily evaluated)
  - map
  - flatMap
  - filter
  - distinct
  - ...
- actions (actual evaluation happens)
  - count
  - reduce
  - take
  - collect
  - takeOrdered
  - ...
Spark transformations enable us to create new RDDs from an existing RDD.
RDD transformations are lazy evaluations (results are not computed right away)
Spark remembers the set of transformations that are applied to a base data set (this is the lineage graph of RDD)
The allows Spark to automatically recover RDDs from failures and slow workers.
The lineage graph is a recipe for creating a result and it can be optimized before execution.
A transformed RDD is executed only when an action runs on it.
You can also persist, or cache, RDDs in memory or on disk (this speeds up iterative ML algorithms that transforms the initial RDD iteratively).
Here is a great reference URL for working with Spark.
- The latest Spark programming guide.

We will do the following next:

Create an RDD using sc.parallelize
Perform the collect action on the RDD and find the number of partitions it is made of using getNumPartitions action
Perform the take action on the RDD
Transform the RDD by map to make another RDD
Transform the RDD by filter to make another RDD
Perform the reduce action on the RDD
Transform the RDD by flatMap to make another RDD
Create a Pair RDD
Perform some transformations on a Pair RDD
Where in the cluster is your computation running?
Shipping Closures, Broadcast Variables and Accumulator Variables
Spark Essentials: Summary
HOMEWORK

println(spark)

org.apache.spark.sql.SparkSession@39898dfa

println(sc)
println(sqlContext)

org.apache.spark.SparkContext@1b4c5f0a org.apache.spark.sql.hive.HiveContext@c8d26a

val x = sc.parallelize(Array(1, 2, 3), 2)    // <Ctrl+Enter> to evaluate this cell (using 2 partitions)

x: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[73] at parallelize at command-112937334110662:1

//x.  // place the cursor after 'x.' and hit Tab to see the methods available for the RDD x we created

x.collect()    // <Ctrl+Enter> to collect (action) elements of rdd; should be (1, 2, 3)

res3: Array[Int] = Array(1, 2, 3)

// <Ctrl+Enter> to evaluate this cell and find the number of partitions in RDD x
x.getNumPartitions

res4: Int = 2

x.glom().collect() // glom() flattens elements on the same partition

res5: Array[Array[Int]] = Array(Array(1), Array(2, 3))

val x = sc.parallelize(Seq(1, 2, 3))    // <Shift+Enter> to evaluate this cell (using default number of partitions)

x: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[87] at parallelize at command-112937334110675:1

x.getNumPartitions // <Shift+Enter> to evaluate this cell

res13: Int = 8

x.glom().collect() // <Ctrl+Enter> to evaluate this cell

res14: Array[Array[Int]] = Array(Array(), Array(), Array(1), Array(), Array(), Array(2), Array(), Array(3))

x.take(2) // Ctrl+Enter to take two elements from the RDD x

res15: Array[Int] = Array(1, 2)

//x.take(  ) // uncomment by removing '//' before x in the cell and fill in the parenthesis to take just one element from RDD x and Cntrl+Enter

// Shift+Enter to make RDD x and RDD y that is mapped from x
val x = sc.parallelize(Array("b", "a", "c")) // make RDD x: [b, a, c]
val y = x.map(z => (z,1))                    // map x into RDD y: [(b, 1), (a, 1), (c, 1)]

x: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[89] at parallelize at command-112937334110684:2 y: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[90] at map at command-112937334110684:3

// Cntrl+Enter to collect and print the two RDDs
println(x.collect().mkString(", "))
println(y.collect().mkString(", "))

b, a, c (b,1), (a,1), (c,1)

//Shift+Enter to make RDD x and filter it by (n => n%2 == 1) to make RDD y
val x = sc.parallelize(Array(1,2,3))
// the closure (n => n%2 == 1) in the filter will 
// return True if element n in RDD x has remainder 1 when divided by 2 (i.e., if n is odd)
val y = x.filter(n => n%2 == 1)

x: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[91] at parallelize at command-112937334110688:2 y: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[92] at filter at command-112937334110688:5

// Cntrl+Enter to collect and print the two RDDs
println(x.collect().mkString(", "))
println(y.collect().mkString(", "))
//y.collect()

1, 2, 3 1, 3

//Shift+Enter to make RDD x of inteegrs 1,2,3,4 and reduce it to sum
val x = sc.parallelize(Array(1,2,3,4))
val y = x.reduce((a,b) => a+b)

x: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[99] at parallelize at command-112937334110692:2 y: Int = 10

//Cntrl+Enter to collect and print RDD x and the Int y, sum of x
println(x.collect.mkString(", "))
println(y)

1, 2, 3, 4 10

//Shift+Enter to make RDD x and flatMap it into RDD by closure (n => Array(n, n*100, 42))
val x = sc.parallelize(Array(1,2,3))
val y = x.flatMap(n => Array(n, n*100, 42))

x: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[100] at parallelize at command-112937334110696:2 y: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[101] at flatMap at command-112937334110696:3

//Cntrl+Enter to collect and print RDDs x and y
println(x.collect().mkString(", "))
println(y.collect().mkString(", "))

1, 2, 3 1, 100, 42, 2, 200, 42, 3, 300, 42

// Cntrl+Enter to make RDD words and display it by collect
val words = sc.parallelize(Array("a", "b", "a", "a", "b", "b", "a", "a", "a", "b", "b"))
words.collect()

words: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[114] at parallelize at command-112937334110699:2 res30: Array[String] = Array(a, b, a, a, b, b, a, a, a, b, b)

// Cntrl+Enter to make and collect Pair RDD wordCountPairRDD
val wordCountPairRDD = words.map(s => (s, 1))
wordCountPairRDD.collect()

wordCountPairRDD: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[115] at map at command-112937334110701:2 res31: Array[(String, Int)] = Array((a,1), (b,1), (a,1), (a,1), (b,1), (b,1), (a,1), (a,1), (a,1), (b,1), (b,1))

9. Perform some transformations on a Pair RDD

Let's next work with RDD of (key,value) pairs called a Pair RDD or Key-Value RDD.

Now some of the Key-Value transformations that we could perform include the following.

reduceByKey transformation
- which takes an RDD and returns a new RDD of key-value pairs, such that:
  - the values for each key are aggregated using the given reduced function
  - and the reduce function has to be of the type that takes two values and returns one value.
sortByKey transformation
- this returns a new RDD of key-value pairs that's sorted by keys in ascending order
groupByKey transformation
- this returns a new RDD consisting of key and iterable-valued pairs.

Let's see some concrete examples next.

// Cntrl+Enter to reduceByKey and collect wordcounts RDD
//val wordcounts = wordCountPairRDD.reduceByKey( _ + _ )
val wordcounts = wordCountPairRDD.reduceByKey( (value1, value2) => value1 + value2 )
wordcounts.collect()

wordcounts: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[116] at reduceByKey at command-112937334110704:3 res32: Array[(String, Int)] = Array((a,6), (b,5))

//Cntrl+Enter to make words RDD and do the word count in two lines
val words = sc.parallelize(Array("a", "b", "a", "a", "b", "b", "a", "a", "a", "b", "b"))
val wordcounts = words
                    .map(s => (s, 1))
                    .reduceByKey(_ + _)
                    .collect()

words: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[117] at parallelize at command-112937334110706:2 wordcounts: Array[(String, Int)] = Array((a,6), (b,5))

// Shift+Enter and comprehend code
val words = sc.parallelize(Array("a", "b", "a", "a", "b", "b", "a", "a", "a", "b", "b"))
val wordCountPairRDD = words.map(s => (s, 1))
val wordCountPairRDDSortedByKey = wordCountPairRDD.sortByKey()

words: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[120] at parallelize at command-112937334110708:2 wordCountPairRDD: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[121] at map at command-112937334110708:3 wordCountPairRDDSortedByKey: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[124] at sortByKey at command-112937334110708:4

wordCountPairRDD.collect() // Shift+Enter and comprehend code

res33: Array[(String, Int)] = Array((a,1), (b,1), (a,1), (a,1), (b,1), (b,1), (a,1), (a,1), (a,1), (b,1), (b,1))

wordCountPairRDDSortedByKey.collect() // Cntrl+Enter and comprehend code

res34: Array[(String, Int)] = Array((a,1), (a,1), (a,1), (a,1), (a,1), (a,1), (b,1), (b,1), (b,1), (b,1), (b,1))

The next key value transformation we will see is groupByKey

When we apply the groupByKey transformation to wordCountPairRDD we end up with a new RDD that contains two elements. The first element is the tuple b and an iterable CompactBuffer(1,1,1,1,1) obtained by grouping the value 1 for each of the five key value pairs (b,1). Similarly the second element is the key a and an iterable CompactBuffer(1,1,1,1,1,1) obtained by grouping the value 1 for each of the six key value pairs (a,1).

CAUTION: groupByKey can cause a large amount of data movement across the network. It also can create very large iterables at a worker. Imagine you have an RDD where you have 1 billion pairs that have the key a. All of the values will have to fit in a single worker if you use group by key. So instead of a group by key, consider using reduced by key.

val wordCountPairRDDGroupByKey = wordCountPairRDD.groupByKey() // <Shift+Enter> CAUTION: this transformation can be very wide!

wordCountPairRDDGroupByKey: org.apache.spark.rdd.RDD[(String, Iterable[Int])] = ShuffledRDD[125] at groupByKey at command-112937334110713:1

wordCountPairRDDGroupByKey.collect()  // Cntrl+Enter

res35: Array[(String, Iterable[Int])] = Array((a,CompactBuffer(1, 1, 1, 1, 1, 1)), (b,CompactBuffer(1, 1, 1, 1, 1)))

val list = 1 to 10
var sum = 0
list.map(x => sum = sum + x)
print(sum)

55list: scala.collection.immutable.Range.Inclusive = Range(1, 2, 3, 4, 5, 6, 7, 8, 9, 10) sum: Int = 55

val rdd = sc.parallelize(1 to 10)
var sum = 0

rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[126] at parallelize at command-112937334110717:1 sum: Int = 0

val rdd1 = rdd.map(x => sum = sum + x)

rdd1: org.apache.spark.rdd.RDD[Unit] = MapPartitionsRDD[127] at map at command-112937334110718:1

rdd1.collect()

res37: Array[Unit] = Array((), (), (), (), (), (), (), (), (), ())

val rdd1 = rdd.map(x => 
                   {var sum = 0;
                         sum = sum + x
                         sum}
                  )

rdd1: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[130] at map at command-112937334110720:1

rdd1.collect()

res40: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

%md
### 11. Shipping Closures, Broadcast Variables and Accumulator Variables

#### Closures, Broadcast and Accumulator Variables
**(watch now 2:06)**:

[![Closures, Broadcast and Accumulators by Anthony Joseph in BerkeleyX/CS100.1x](http://img.youtube.com/vi/I9Zcr4R35Ao/0.jpg)](https://www.youtube.com/watch?v=I9Zcr4R35Ao?rel=0&autoplay=1&modestbranding=1)


We will use these variables in the sequel.

#### SUMMARY
Spark automatically creates closures 

  * for functions that run on RDDs at workers,
  * and for any global variables that are used by those workers
  * one closure per worker is sent with every task
  * and there's no communication between workers
  * closures are one way from the driver to the worker
  * any changes that you make to the global variables at the workers 
    * are not sent to the driver or
    * are not sent to other workers.
  
    
 The problem we have is that these closures
 
   * are automatically created are sent or re-sent with every job
   * with a large global variable it gets inefficient to send/resend lots of data to each worker
   * we cannot communicate that back to the driver
  
  
 To do this, Spark provides shared variables in two different types.
 
  * **broadcast variables**
    * lets us to efficiently send large read-only values to all of the workers
    * these are saved at the workers for use in one or more Spark operations.    
  * **accumulator variables**
    * These allow us to aggregate values from workers back to the driver.
    * only the driver can access the value of the accumulator 
    * for the tasks, the accumulators are basically write-only
    
 ***
 
 ### 12. Spark Essentials: Summary
 **(watch now: 0:29)**
 
[![Spark Essentials Summary by Anthony Joseph in BerkeleyX/CS100.1x](http://img.youtube.com/vi/F50Vty9Ia8Y/0.jpg)](https://www.youtube.com/watch?v=F50Vty9Ia8Y?rel=0&autoplay=1&modestbranding=1)

*NOTE:* In databricks cluster, we (the course coordinator/administrators) set the number of workers for you.

11. Shipping Closures, Broadcast Variables and Accumulator Variables

Closures, Broadcast and Accumulator Variables

(watch now 2:06):

We will use these variables in the sequel.

SUMMARY

Spark automatically creates closures

for functions that run on RDDs at workers,
and for any global variables that are used by those workers
one closure per worker is sent with every task
and there's no communication between workers
closures are one way from the driver to the worker
any changes that you make to the global variables at the workers
- are not sent to the driver or
- are not sent to other workers.

The problem we have is that these closures

are automatically created are sent or re-sent with every job
with a large global variable it gets inefficient to send/resend lots of data to each worker
we cannot communicate that back to the driver

To do this, Spark provides shared variables in two different types.

broadcast variables
- lets us to efficiently send large read-only values to all of the workers
- these are saved at the workers for use in one or more Spark operations.
accumulator variables
- These allow us to aggregate values from workers back to the driver.
- only the driver can access the value of the accumulator
- for the tasks, the accumulators are basically write-only

12. Spark Essentials: Summary

(watch now: 0:29)

NOTE: In databricks cluster, we (the course coordinator/administrators) set the number of workers for you.

import scala.math._
val x = min(1, 10)

import scala.math._ x: Int = 1

import java.util.HashMap
val map = new HashMap[String, Int]()
map.put("a", 1)
map.put("b", 2)
map.put("c", 3)
map.put("d", 4)
map.put("e", 5)

import java.util.HashMap map: java.util.HashMap[String,Int] = {a=1, b=2, c=3, d=4, e=5} res41: Int = 0

SDS-2.x, Scalable Data Engineering Science

Introduction to Spark

Spark Essentials: RDDs, Transformations and Actions

Spark Cluster Overview:

Driver Program, Cluster Manager and Worker Nodes

The Abstraction of Resilient Distributed Dataset (RDD)

RDD is a fault-tolerant collection of elements that can be operated on in parallel

Two types of Operations are possible on an RDD

Transformations

Actions

Key Points

Let us get our hands dirty in Spark implementing these ideas!

DO NOW

Let us look at the legend and overview of the visual RDD Api by doing the following first:

Running Spark

We will do the following next:

Entry Point

1. Create an RDD using sc.parallelize

2. Perform the collect action on the RDD and find the number of partitions it is made of using getNumPartitions action

Let us look at the collect action in detail and return here to try out the example codes.

Let us look at the getNumPartitions action in detail and return here to try out the example codes.

You Try!

3. Perform the take action on the RDD

You Try!

4. Transform the RDD by map to make another RDD

Let us look at the map transformation in detail and return here to try out the example codes.

5. Transform the RDD by filter to make another RDD

Let us look at the filter transformation in detail and return here to try out the example codes.

6. Perform the reduce action on the RDD

Let us look at the reduce action in detail and return here to try out the example codes.

7. Transform an RDD by flatMap to make another RDD

Let us look at the flatMap transformation in detail and return here to try out the example codes.

8. Create a Pair RDD

9. Perform some transformations on a Pair RDD

You Try!

10. Where in the cluster is your computation running?

11. Shipping Closures, Broadcast Variables and Accumulator Variables

Closures, Broadcast and Accumulator Variables

SUMMARY

12. Spark Essentials: Summary

13. HOMEWORK

Importing Standard Scala and Java libraries

1. Create an RDD using `sc.parallelize`

2. Perform the `collect` action on the RDD and find the number of partitions it is made of using `getNumPartitions` action

3. Perform the `take` action on the RDD

4. Transform the RDD by `map` to make another RDD

5. Transform the RDD by `filter` to make another RDD

6. Perform the `reduce` action on the RDD

7. Transform an RDD by `flatMap` to make another RDD