ScaDaMaLe Course site and book

Scala Crash Course

Here we take a minimalist approach to learning just enough Scala, the language that Apache Spark is written in, to be able to use Spark effectively.

In the sequel we can learn more Scala concepts as they arise. This learning can be done by chasing the pointers in this crash course for a detailed deeper dive on your own time.

There are two basic ways in which we can learn Scala:

1. Learn Scala in a notebook environment

For convenience we use databricks Scala notebooks like this one here.

You can learn Scala locally on your own computer using Scala REPL (and Spark using Spark-Shell).

2. Learn Scala in your own computer

The most easy way to get Scala locally is through sbt, the Scala Build Tool. You can also use an IDE that integrates sbt.

See: https://docs.scala-lang.org/getting-started/index.html to set up Scala in your own computer.

Software Engineering NOTE: If you completed TASK 2 for Cloud-free Computing Environment in the notebook prefixed 002_00 using dockerCompose (optional exercise) then you will have Scala 2.11 with sbt and Spark 2.4 inside the docker services you can start and stop locally. Using docker volume binds you can also connect the docker container and its services (including local zeppelin or jupyter notebook servers as well as hadoop file system) to IDEs on your machine, etc.

Scala Resources

You will not be learning scala systematically and thoroughly in this course. You will learn to use Scala by doing various Spark jobs.

If you are interested in learning scala properly, then there are various resources, including:

The main sources for the following content are (you are encouraged to read them for more background):

What is Scala?

"Scala smoothly integrates object-oriented and functional programming. It is designed to express common programming patterns in a concise, elegant, and type-safe way." by Matrin Odersky.

  • High-level language for the Java Virtual Machine (JVM)
  • Object oriented + functional programming
  • Statically typed
  • Comparable in speed to Java
  • Type inference saves us from having to write explicit types most of the time Interoperates with Java
  • Can use any Java class (inherit from, etc.)
  • Can be called from Java code

See a quick tour here:

Why Scala?

  • Spark was originally written in Scala, which allows concise function syntax and interactive use
  • Spark APIs for other languages include:
    • Java API for standalone use
    • Python API added to reach a wider user community of programmes
    • R API added more recently to reach a wider community of data analyststs
    • Unfortunately, Python and R APIs are generally behind Spark's native Scala (for eg. GraphX is only available in Scala currently and datasets are only available in Scala as of 20200918).
  • See Darren Wilkinson's 11 reasons for scala as a platform for statistical computing and data science. It is embedded in-place below for your convenience.

Learn Scala in Notebook Environment


Run a Scala Cell

  • Run the following scala cell.
  • Note: There is no need for any special indicator (such as %md) necessary to create a Scala cell in a Scala notebook.
  • You know it is a scala notebook because of the (Scala) appended to the name of this notebook.
  • Make sure the cell contents updates before moving on.
  • Press Shift+Enter when in the cell to run it and proceed to the next cell.
    • The cells contents should update.
    • Alternately, press Ctrl+Enter when in a cell to run it, but not proceed to the next cell.
  • characters following // are comments in scala. ***
1+1
res0: Int = 2
println(System.currentTimeMillis) // press Ctrl+Enter to evaluate println that prints its argument as a line
1610582084465
frameIt: (u: String, h: Int)String

Let's get our hands dirty in Scala

We will go through the following programming concepts and tasks by building on https://docs.scala-lang.org/tour/basics.html.

  • Scala Types
  • Expressions and Printing
  • Naming and Assignments
  • Functions and Methods in Scala
  • Classes and Case Classes
  • Methods and Tab-completion
  • Objects and Traits
  • Collections in Scala and Type Hierarchy
  • Functional Programming and MapReduce
  • Lazy Evaluations and Recursions

Remark: You need to take a computer science course (from CourseEra, for example) to properly learn Scala. Here, we will learn to use Scala by example to accomplish our data science tasks at hand. You can learn more Scala as needed from various sources pointed out above in Scala Resources.

Scala Types

In Scala, all values have a type, including numerical values and functions. The diagram below illustrates a subset of the type hierarchy.

For now, notice some common types we will be usinf including Int, String, Double, Unit, Boolean, List, etc. For more details see https://docs.scala-lang.org/tour/unified-types.html. We will return to this at the end of the notebook after seeing a brief tour of Scala now.

Expressions

Expressions are computable statements such as the 1+1 we have seen before.

1+1
res3: Int = 2

We can print the output of a computed or evaluated expressions as a line using println:

println(1+1) // printing 2
2
println("hej hej!") // printing a string
hej hej!

Naming and Assignments

value and variable as val and var

You can name the results of expressions using keywords val and var.

Let us assign the integer value 5 to x as follows:

val x : Int = 5 // <Ctrl+Enter> to declare a value x to be integer 5. 
x: Int = 5

x is a named result and it is a value since we used the keyword val when naming it.

Scala is statically typed, but it uses built-in type inference machinery to automatically figure out that x is an integer or Int type as follows. Let's declare a value x to be Int 5 next without explictly using Int.

val x = 5    // <Ctrl+Enter> to declare a value x as Int 5 (type automatically inferred)
x: Int = 5

Let's declare x as a Double or double-precision floating-point type using decimal such as 5.0 (a digit has to follow the decimal point!)

val x = 5.0   // <Ctrl+Enter> to declare a value x as Double 5
x: Double = 5.0

Alternatively, we can assign x as a Double explicitly. Note that the decimal point is not needed in this case due to explicit typing as Double.

val x :  Double = 5    // <Ctrl+Enter> to declare a value x as Double 5 (type automatically inferred)
x: Double = 5.0

Next note that labels need to be declared on first use. We have declared x to be a val which is short for value. This makes x immutable (cannot be changed).

Thus, x cannot be just re-assigned, as the following code illustrates in the resulting error: ... error: reassignment to val.

//x = 10    //  uncomment and <Ctrl+Enter> to try to reassign val x to 10

Scala allows declaration of mutable variables as well using var, as follows:

var y = 2    // <Shift+Enter> to declare a variable y to be integer 2 and go to next cell
y: Int = 2
y = 3    // <Shift+Enter> to change the value of y to 3
y: Int = 3
y = y+1 // adds 1 to y
y: Int = 4
y += 2 // adds 2 to y
println(y) // the var y is 6 now
6

Blocks

Just combine expressions by surrounding them with { and } called a block.

println({
  val x = 1+1
  x+2 // expression in last line is returned for the block
})// prints 4
4
println({ val x=22; x+2})
24

Functions

Functions are expressions that have parameters. A function takes arguments as input and returns expressions as output.

A function can be nameless or anonymous and simply return an output from a given input. For example, the following annymous function returns the square of the input integer.

(x: Int) => x*x
res11: Int => Int = line186c28489fff404184da2d59bd09a90463.$read$$Lambda$5065/1820207503@597d20b

On the left of => is a list of parameters with name and type. On the right is an expression involving the parameters.

You can also name functions:

val multiplyByItself = (x: Int) => x*x
multiplyByItself: Int => Int = line186c28489fff404184da2d59bd09a90465.$read$$Lambda$5067/2036039718@12f273c8
println(multiplyByItself(10))
100

A function can have no parameters:

val howManyAmI = () => 1
howManyAmI: () => Int = line186c28489fff404184da2d59bd09a90469.$read$$Lambda$5070/1826556511@56f9e3f2
println(howManyAmI()) // 1
1

A function can have more than one parameter:

val multiplyTheseTwoIntegers = (a: Int, b: Int) => a*b
multiplyTheseTwoIntegers: (Int, Int) => Int = line186c28489fff404184da2d59bd09a90473.$read$$Lambda$5071/161461748@62178cca
println(multiplyTheseTwoIntegers(2,4)) // 8
8

Methods

Methods are very similar to functions, but a few key differences exist.

Methods use the def keyword followed by a name, parameter list(s), a return type, and a body.

def square(x: Int): Int = x*x    // <Shitf+Enter> to define a function named square
square: (x: Int)Int

Note that the return type Int is specified after the parameter list and a :.

square(5)    // <Shitf+Enter> to call this function on argument 5
res15: Int = 25
val y = 3    // <Shitf+Enter> make val y as Int 3
y: Int = 3
square(y) // <Shitf+Enter> to call the function on val y of the right argument type Int
res16: Int = 9
val x = 5.0     // let x be Double 5.0
x: Double = 5.0
//square(x) // <Shift+Enter> to call the function on val x of type Double will give type mismatch error
def square(x: Int): Int = { // <Shitf+Enter> to declare function in a block
  val answer = x*x
  answer // the last line of the function block is returned
}
square: (x: Int)Int
square(5000)    // <Shift+Enter> to call the function
res18: Int = 25000000
// <Shift+Enter> to define function with input and output type as String
def announceAndEmit(text: String): String = 
{
  println(text)
  text // the last line of the function block is returned
}
announceAndEmit: (text: String)String

Scala has a return keyword but it is rarely used as the expression in the last line of the multi-line block is the method's return value.

// <Ctrl+Enter> to call function which prints as line and returns as String
announceAndEmit("roger  roger")
roger  roger
res19: String = roger  roger

A method can have output expressions involving multiple parameter lists:

def multiplyAndTranslate(x: Int, y: Int)(translateBy: Int): Int = (x * y) + translateBy
multiplyAndTranslate: (x: Int, y: Int)(translateBy: Int)Int
println(multiplyAndTranslate(2, 3)(4))  // (2*3)+4 = 10
10

A method can have no parameter lists at all:

def time: Long = System.currentTimeMillis
time: Long
println("Current time in milliseconds is " + time)
Current time in milliseconds is 1610582096790
println("Current time in milliseconds is " + time)
Current time in milliseconds is 1610582097046

Classes

The class keyword followed by the name and constructor parameters is used to define a class.

class Box(h: Int, w: Int, d: Int) {
  def printVolume(): Unit = println(h*w*d)
}
defined class Box
  • The return type of the method printVolume is Unit.
  • When the return type is Unit it indicates that there is nothing meaningful to return, similar to void in Java and C, but with a difference.
  • Because every Scala expression must have some value, there is actually a singleton value of type Unit, written () and carrying no information.

We can make an instance of the class with the new keyword.

val my1Cube = new Box(1,1,1)
my1Cube: Box = line186c28489fff404184da2d59bd09a904107.$read$Box@6c4cbb75

And call the method on the instance.

my1Cube.printVolume() // 1
1

Our named instance my1Cube of the Box class is immutable due to val.

You can have mutable instances of the class using var.

var myVaryingCuboid = new Box(1,3,2)
myVaryingCuboid: Box = line186c28489fff404184da2d59bd09a904107.$read$Box@77404a48
myVaryingCuboid.printVolume()
6
myVaryingCuboid = new Box(1,1,1)
myVaryingCuboid: Box = line186c28489fff404184da2d59bd09a904107.$read$Box@748cdfd1
myVaryingCuboid.printVolume()
1

See https://docs.scala-lang.org/tour/classes.html for more details as needed.

Case Classes

Scala has a special type of class called a case class that can be defined with the case class keyword.

Unlike classes, whose instances are compared by reference, instances of case classes are immutable by default and compared by value. This makes them useful for defining rows of typed values in Spark.

case class Point(x: Int, y: Int, z: Int)
defined class Point

Case classes can be instantiated without the new keyword.

val point = Point(1, 2, 3)
val anotherPoint = Point(1, 2, 3)
val yetAnotherPoint = Point(2, 2, 2)
point: Point = Point(1,2,3)
anotherPoint: Point = Point(1,2,3)
yetAnotherPoint: Point = Point(2,2,2)

Instances of case classes are compared by value and not by reference.

if (point == anotherPoint) {
  println(point + " and " + anotherPoint + " are the same.")
} else {
  println(point + " and " + anotherPoint + " are different.")
} // Point(1,2,3) and Point(1,2,3) are the same.

if (point == yetAnotherPoint) {
  println(point + " and " + yetAnotherPoint + " are the same.")
} else {
  println(point + " and " + yetAnotherPoint + " are different.")
} // Point(1,2,3) and Point(2,2,2) are different.
Point(1,2,3) and Point(1,2,3) are the same.
Point(1,2,3) and Point(2,2,2) are different.

By contrast, instances of classes are compared by reference.

myVaryingCuboid.printVolume() // should be 1 x 1 x 1
1
my1Cube.printVolume()  // should be 1 x 1 x 1
1
if (myVaryingCuboid == my1Cube) {
  println("myVaryingCuboid and my1Cube are the same.")
} else {
  println("myVaryingCuboid and my1Cube are different.")
} // they are compared by reference and are not the same.
myVaryingCuboid and my1Cube are different.

Methods and Tab-completion

Many methods of a class can be accessed by ..

val s  = "hi"    // <Ctrl+Enter> to declare val s to String "hi"
s: String = hi

You can place the cursor after . following a declared object and find out the methods available for it as shown in the image below.

tabCompletionAfterSDot PNG image

You Try doing this next.

//s.  // place cursor after the '.' and press Tab to see all available methods for s 

For example,

  • scroll down to contains and double-click on it.
  • This should lead to s.contains in your cell.
  • Now add an argument String to see if s contains the argument, for example, try:
    • s.contains("f")
    • s.contains("") and
    • s.contains("i")
//s    // <Shift-Enter> recall the value of String s
s.contains("f")     // <Shift-Enter> returns Boolean false since s does not contain the string "f"
res32: Boolean = false
s.contains("")    // <Shift-Enter> returns Boolean true since s contains the empty string ""
res33: Boolean = true
s.contains("i")    // <Ctrl+Enter> returns Boolean true since s contains the string "i"
res34: Boolean = true

Objects

Objects are single instances of their own definitions using the object keyword. You can think of them as singletons of their own classes.

object IdGenerator {
  private var currentId = 0
  def make(): Int = {
    currentId += 1
    currentId
  }
}
defined object IdGenerator

You can access an object through its name:

val newId: Int = IdGenerator.make()
val newerId: Int = IdGenerator.make()
newId: Int = 1
newerId: Int = 2
println(newId) // 1
println(newerId) // 2
1
2

Traits

Traits are abstract data types containing certain fields and methods. They can be defined using the trait keyword.

In Scala inheritance, a class can only extend one other class, but it can extend multiple traits.

trait Greeter {
  def greet(name: String): Unit
}
defined trait Greeter

Traits can have default implementations also.

trait Greeter {
  def greet(name: String): Unit =
    println("Hello, " + name + "!")
}
defined trait Greeter

You can extend traits with the extends keyword and override an implementation with the override keyword:

class DefaultGreeter extends Greeter

class SwedishGreeter extends Greeter {
  override def greet(name: String): Unit = {
    println("Hej hej, " + name + "!")
  }
}

class CustomizableGreeter(prefix: String, postfix: String) extends Greeter {
  override def greet(name: String): Unit = {
    println(prefix + name + postfix)
  }
}
defined class DefaultGreeter
defined class SwedishGreeter
defined class CustomizableGreeter

Instantiate the classes.

val greeter = new DefaultGreeter()
val swedishGreeter = new SwedishGreeter()
val customGreeter = new CustomizableGreeter("How are you, ", "?")
greeter: DefaultGreeter = line186c28489fff404184da2d59bd09a904155.$read$DefaultGreeter@5d7c7786
swedishGreeter: SwedishGreeter = line186c28489fff404184da2d59bd09a904155.$read$SwedishGreeter@a1c1128
customGreeter: CustomizableGreeter = line186c28489fff404184da2d59bd09a904155.$read$CustomizableGreeter@7c2dc867

Call the greet method in each case.

greeter.greet("Scala developer") // Hello, Scala developer!
swedishGreeter.greet("Scala developer") // Hej hej, Scala developer!
customGreeter.greet("Scala developer") // How are you, Scala developer?
Hello, Scala developer!
Hej hej, Scala developer!
How are you, Scala developer?

A class can also be made to extend multiple traits.

For more details see: https://docs.scala-lang.org/tour/traits.html.

Main Method

The main method is the entry point of a Scala program.

The Java Virtual Machine requires a main method, named main, that takes an array of strings as its only argument.

Using an object, you can define the main method as follows:

object Main {
  def main(args: Array[String]): Unit =
    println("Hello, Scala developer!")
}
defined object Main

What I try not do while learning a new language?

  1. I don't immediately try to ask questions like: how can I do this particular variation of some small thing I just learned so I can use patterns I am used to from another language I am hooked-on right now?
  2. first go through the detailed Scala Tour on your own and then through the 50 odd lessons in the Scala Book
  3. then return to 1. and ask detailed cross-language comparison questions by diving deep as needed with the source and scala docs as needed (google or duck-duck-go search!).