ScaDaMaLe Course site and book

Scala Crash Course Continued

Recall!

Scala Resources

You will not be learning scala systematically and thoroughly in this course. You will learn to use Scala by doing various Spark jobs.

If you are interested in learning scala properly, then there are various resources, including:

The main sources for the following content are (you are encouraged to read them for more background):

//%run "/scalable-data-science/xtraResources/support/sdsFunctions"
//This allows easy embedding of publicly available information into any other notebook
//when viewing in git-book just ignore this block - you may have to manually chase the URL in frameIt("URL").
//Example usage:
// displayHTML(frameIt("https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation#Topics_in_LDA",250))
def frameIt( u:String, h:Int ) : String = {
      """<iframe 
 src=""""+ u+""""
 width="95%" height="""" + h + """"
 sandbox>
  <p>
    <a href="http://spark.apache.org/docs/latest/index.html">
      Fallback link for browsers that, unlikely, don't support frames
    </a>
  </p>
</iframe>"""
   }
frameIt: (u: String, h: Int)String

Let's continue getting our hands dirty in Scala

We will go through the remaining programming concepts and tasks by building on https://docs.scala-lang.org/tour/basics.html.

  • Scala Types
  • Expressions and Printing
  • Naming and Assignments
  • Functions and Methods in Scala
  • Classes and Case Classes
  • Methods and Tab-completion
  • Objects and Traits
  • Collections in Scala and Type Hierarchy
  • Functional Programming and MapReduce
  • Lazy Evaluations and Recursions

Remark: You need to take a computer science course (from CourseEra, for example) to properly learn Scala. Here, we will learn to use Scala by example to accomplish our data science tasks at hand. You can learn more Scala as needed from various sources pointed out above in Scala Resources.

Scala Type Hierarchy

In Scala, all values have a type, including numerical values and functions. The diagram below illustrates a subset of the type hierarchy.

For now, notice some common types we will be usinf including Int, String, Double, Unit, Boolean, List, etc. For more details see https://docs.scala-lang.org/tour/unified-types.html.

Let us take a closer look at Scala Type Hierarchy now.

displayHTML(frameIt("https://docs.scala-lang.org/tour/unified-types.html",550))

Scala Collections

Familiarize yourself with the main Scala collections classes here:

displayHTML(frameIt("https://docs.scala-lang.org/overviews/scala-book/collections-101.html",550))

List

Lists are one of the most basic data structures.

There are several other Scala collections and we will introduce them as needed. The other most common ones are Vector, Array and Seq and the ArrayBuffer.

For details on list see: - https://docs.scala-lang.org/overviews/scala-book/list-class.html

// <Ctrl+Enter> to declare (an immutable) val lst as List of Int's 1,2,3
val lst = List(1, 2, 3)
lst: List[Int] = List(1, 2, 3)

Vectors

The Vector class is an indexed, immutable sequence. The “indexed” part of the description means that you can access Vector elements very rapidly by their index value, such as accessing listOfPeople(999999).

In general, except for the difference that Vector is indexed and List is not, the two classes work the same, so we’ll run through these examples quickly.

For details see: - https://docs.scala-lang.org/overviews/scala-book/vector-class.html

val vec = Vector(1,2,3)
vec: scala.collection.immutable.Vector[Int] = Vector(1, 2, 3)
val arr = Array(1,2,3) // <Shift-Enter> to declare an Array
arr: Array[Int] = Array(1, 2, 3)
val seq = Seq(1,2,3)    // <Shift-Enter> to declare a Seq
seq: Seq[Int] = List(1, 2, 3)

A tuple is a neat class that gives you a simple way to store heterogeneous (different) items in the same container. We will use tuples for key-value pairs in Spark.

See https://docs.scala-lang.org/overviews/scala-book/tuples.html

val myTuple = ('a',1) // a 2-tuple
myTuple: (Char, Int) = (a,1)
myTuple._1 // accessing the first element of the tuple. NOTE index starts at 1 not 0 for tuples
res2: Char = a
myTuple._2 // accessing the second element of the tuple
res3: Int = 1

Functional Programming and MapReduce

"Functional programming is a style of programming that emphasizes writing applications using only pure functions and immutable values. As Alvin Alexander wrote in Functional Programming, Simplified, rather than using that description, it can be helpful to say that functional programmers have an extremely strong desire to see their code as math — to see the combination of their functions as a series of algebraic equations. In that regard, you could say that functional programmers like to think of themselves as mathematicians. That’s the driving desire that leads them to use only pure functions and immutable values, because that’s what you use in algebra and other forms of math."

See https://docs.scala-lang.org/overviews/scala-book/functional-programming.html for short lessons in functional programming.

We will apply functions for processing elements of a scala collection to quickly demonstrate functional programming.

Five ways of adding 1

The first four use anonymous functions and the last one uses a named method.

  1. explicit version:
(x: Int) => x + 1  
  1. type-inferred more intuitive version:
x => x + 1   
  1. placeholder syntax (each argument must be used exactly once):
_ + 1 
  1. type-inferred more intuitive version with code-block for larger function body:
x => { 
      // body is a block of code
      val integerToAdd = 1
      x + integerToAdd
}
  1. as methods using def:
def addOne(x: Int): Int = x + 1
displayHTML(frameIt("https://superruzafa.github.io/visual-scala-reference/map/",500))

Now, let's do some functional programming over scala collection (List) using some of their methods: map, filter and reduce. In the end we will write our first mapReduce program!

For more details see:

displayHTML(frameIt("https://superruzafa.github.io/visual-scala-reference/map/",500))
// <Shift+Enter> to map each value x of lst with x+10 to return a new List(11, 12, 13)
lst.map(x => x + 10)  
res6: List[Int] = List(11, 12, 13)
// <Shift+Enter> for the same as above using place-holder syntax
lst.map( _ + 10)  
res7: List[Int] = List(11, 12, 13)
displayHTML(frameIt("https://superruzafa.github.io/visual-scala-reference/filter/",600))
// <Shift+Enter> to return a new List(1, 3) after filtering x's from lst if (x % 2 == 1) is true
lst.filter(x => (x % 2 == 1) )
res9: List[Int] = List(1, 3)
// <Shift+Enter> for the same as above using place-holder syntax
lst.filter( _ % 2 == 1 )
res10: List[Int] = List(1, 3)
displayHTML(frameIt("https://superruzafa.github.io/visual-scala-reference/reduce/",600))
// <Shift+Enter> to use reduce to add elements of lst two at a time to return Int 6
lst.reduce( (x, y) => x + y )
res12: Int = 6
// <Ctrl+Enter> for the same as above but using place-holder syntax
lst.reduce( _ + _ )
res13: Int = 6

Let's combine map and reduce programs above to find the sum of after 10 has been added to every element of the original List lst as follows:

lst.map(x => x+10)
   .reduce((x,y) => x+y) // <Ctrl-Enter> to get Int 36 = sum(1+10,2+10,3+10)
res14: Int = 36

Exercise in Functional Programming

You should spend an hour or so going through the Functional Programming Section of the Scala Book:

displayHTML(frameIt("https://docs.scala-lang.org/overviews/scala-book/functional-programming.html",700))

There are lots of methods in Scala Collections. And much more in this scalable language. See for example http://docs.scala-lang.org/cheatsheets/index.html.

Lazy Evaluation

Another powerful programming concept we will need is lazy evaluation -- a form of delayed evaluation. So the value of an expression that is lazily evaluated is only available when it is actually needed.

This is to be contrasted with eager evaluation that we have seen so far -- an expression is immediately evaluated.

val eagerImmutableInt = 1 // eagerly evaluated as 1
eagerImmutableInt: Int = 1
var eagerMutableInt = 2 // eagerly evaluated as 2
eagerMutableInt: Int = 2

Let's demonstrate lazy evaluation using a getTime method and the keyword lazy.

import java.util.Calendar
import java.util.Calendar
lazy val lazyImmutableTime = Calendar.getInstance.getTime // lazily defined and not evaluated immediately
lazyImmutableTime: java.util.Date = <lazy>
val eagerImmutableTime = Calendar.getInstance.getTime // egaerly evaluated immediately
eagerImmutableTime: java.util.Date = Wed Jan 13 23:54:53 UTC 2021
println(lazyImmutableTime) // evaluated when actully needed by println
Wed Jan 13 23:54:53 UTC 2021
println(eagerImmutableTime) // prints what was already evaluated eagerly
Wed Jan 13 23:54:53 UTC 2021
def lazyDefinedInt = 5 // you can also use method to lazily define 
lazyDefinedInt: Int
lazyDefinedInt // only evaluated now
res18: Int = 5

See https://www.scala-exercises.org/scalatutorial/lazyevaluation for more details including the following example with StringBuilder.

val builder = new StringBuilder
builder: StringBuilder =
val x = { builder += 'x'; 1 } // eagerly evaluates x as 1 after appending 'x' to builder. NOTE: ';' is used to separate multiple expressions on the same line
x: Int = 1
builder.result()
res19: String = x
x
res20: Int = 1
builder.result() // calling x again should not append x again to builder
res21: String = x
lazy val y = { builder += 'y'; 2 } // lazily evaluate y later when it is called
y: Int = <lazy>
builder.result() // builder should remain unchanged
res22: String = x
def z = { builder += 'z'; 3 } // lazily evaluate z later when the method is called
z: Int
builder.result() // builder should remain unchanged
res23: String = x

What should builder.result() be after the following arithmetic expression involving x,y and z is evaluated?

z + y + x + z + y + x
res24: Int = 12

Lazy Evaluation Exercise - You try Now!

Understand why the output above is what it is!

  • Why is z different in its appearance in the final builder string when compared to x and y as we evaluated?
z + y + x + z + y + x
builder.result() 
res25: String = xzyz

Why Lazy?

Imagine a more complex expression involving the evaluation of millions of values. Lazy evaluation will allow us to actually compute with big data when it may become impossible to hold all the values in memory. This is exactly what Apache Spark does as we will see.

Recursions

Recursion is a powerful framework when a function calls another function, including itself, until some terminal condition is reached.

Here we want to distinguish between two ways of implementing a recursion using a simple example of factorial.

Recall that for any natural number \(n\), its factorial is denoted and defined as follows:

\[ n! := n \times (n-1) \times (n-2) \times \cdots \times 2 \times 1 \]

which has the following recursive expression:

\[ n! = n*(n-1)! , , \qquad 0! = 1 \]

Let us implement it using two approaches: a naive approach that can run out of memory and another tail-recursive approach that uses constant memory. Read https://www.scala-exercises.org/scalatutorial/tailrecursion for details.

def factorialNaive(n: Int): Int =
  if (n == 0) 1 else n * factorialNaive(n - 1)
factorialNaive: (n: Int)Int
factorialNaive(4)
res26: Int = 24

When factorialNaive(4) was evaluated above the following steps were actually done:

factorial(4)
if (4 == 0) 1 else 4 * factorial(4 - 1)
4 * factorial(3)
4 * (3 * factorial(2))
4 * (3 * (2 * factorial(1)))
4 * (3 * (2 * (1 * factorial(0)))
4 * (3 * (2 * (1 * 1)))
24

Notice how we add one more element to our expression at each recursive call. Our expressions becomes bigger and bigger until we end by reducing it to the final value. So the final expression given by a directed acyclic graph (DAG) of the pairwise multiplications given by the right-branching binary tree, whose leaves are input integers and internal nodes are the bianry * operator, can get very large when the input n is large.

Tail recursion is a sophisticated way of implementing certain recursions so that memory requirements can be kept constant, as opposed to naive recursions.

Tail Recursion

That difference in the rewriting rules actually translates directly to a difference in the actual execution on a computer. In fact, it turns out that if you have a recursive function that calls itself as its last action, then you can reuse the stack frame of that function. This is called tail recursion.

And by applying that trick, a tail recursive function can execute in constant stack space, so it's really just another formulation of an iterative process. We could say a tail recursive function is the functional form of a loop, and it executes just as efficiently as a loop.

Implementation of tail recursion in the Exercise below uses Scala annotation, which is a way to associate meta-information with definitions. In our case, the annotation @tailrec ensures that a method is indeed tail-recursive. See the last link to understand how memory requirements can be kept constant in tail recursions.

We mainly want you to know that tail recursions are an important functional programming concept.

Tail Recursion Exercise - You Try Now

Replace ??? with the correct values to make this a tail recursion for factorial.

import scala.annotation.tailrec

// replace ??? with the right values to make this a tail recursion for factorial
def factorialTail(n: Int): Int = {
  @tailrec
  def iter(x: Int, result: Int): Int =
    if ( x == ??? ) result
    else iter(x - 1, result * x)

  iter( n, ??? )
}
factorialTail(3) //shouldBe 6
factorialTail(4) //shouldBe 24

Functional Programming is a vast subject and we are merely covering the fewest core ideas to get started with Apache Spark asap.

We will return to more concepts as we need them in the sequel.