Scala Crash Course
Here we take a minimalist approach to learning just enough Scala, the language that Apache Spark is written in, to be able to use Spark effectively.
In the sequel we can learn more Scala concepts as they arise. This learning can be done by chasing the pointers in this crash course for a detailed deeper dive on your own time.
There are two basic ways in which we can learn Scala:
1. Learn Scala in a notebook environment
For convenience we use databricks Scala notebooks like this one here.
You can learn Scala locally on your own computer using Scala REPL (and Spark using Spark-Shell).
2. Learn Scala in your own computer
The most easy way to get Scala locally is through sbt, the Scala Build Tool. You can also use an IDE that integrates sbt.
See: https://docs.scala-lang.org/getting-started/index.html to set up Scala in your own computer.
Software Engineering NOTE: If you completed TASK 2 for Cloud-free Computing Environment in the notebook prefixed 002_00
using dockerCompose (optional exercise) then you will have Scala 2.11 with sbt and Spark 2.4 inside the docker services you can start and stop locally. Using docker volume binds you can also connect the docker container and its services (including local zeppelin or jupyter notebook servers as well as hadoop file system) to IDEs on your machine, etc.
Scala Resources
You will not be learning scala systematically and thoroughly in this course. You will learn to use Scala by doing various Spark jobs.
If you are interested in learning scala properly, then there are various resources, including:
- scala-lang.org is the core Scala resource. Bookmark the following three links:
- tour-of-scala - Bite-sized introductions to core language features.
- we will go through the tour in a hurry now as some Scala familiarity is needed immediately.
- scala-book - An online book introducing the main language features
- you are expected to use this resource to figure out Scala as needed.
- scala-cheatsheet - A handy cheatsheet covering the basics of Scala syntax.
- visual-scala-reference - This guide collects some of the most common functions of the Scala Programming Language and explain them conceptual and graphically in a simple way.
- tour-of-scala - Bite-sized introductions to core language features.
- Online Resources, including:
- Books
The main sources for the following content are (you are encouraged to read them for more background):
- Martin Oderski's Scala by example
- Scala crash course by Holden Karau
- Darren's brief introduction to scala and breeze for statistical computing
What is Scala?
"Scala smoothly integrates object-oriented and functional programming. It is designed to express common programming patterns in a concise, elegant, and type-safe way." by Matrin Odersky.
- High-level language for the Java Virtual Machine (JVM)
- Object oriented + functional programming
- Statically typed
- Comparable in speed to Java
- Type inference saves us from having to write explicit types most of the time Interoperates with Java
- Can use any Java class (inherit from, etc.)
- Can be called from Java code
See a quick tour here:
Why Scala?
- Spark was originally written in Scala, which allows concise function syntax and interactive use
- Spark APIs for other languages include:
- Java API for standalone use
- Python API added to reach a wider user community of programmes
- R API added more recently to reach a wider community of data analyststs
- Unfortunately, Python and R APIs are generally behind Spark's native Scala (for eg. GraphX is only available in Scala currently and datasets are only available in Scala as of 20200918).
- See Darren Wilkinson's 11 reasons for scala as a platform for statistical computing and data science. It is embedded in-place below for your convenience.
Learn Scala in Notebook Environment
Run a Scala Cell
- Run the following scala cell.
- Note: There is no need for any special indicator (such as
%md
) necessary to create a Scala cell in a Scala notebook. - You know it is a scala notebook because of the
(Scala)
appended to the name of this notebook. - Make sure the cell contents updates before moving on.
- Press Shift+Enter when in the cell to run it and proceed to the next cell.
- The cells contents should update.
- Alternately, press Ctrl+Enter when in a cell to run it, but not proceed to the next cell.
- characters following
//
are comments in scala. ***
1+1
res0: Int = 2
println(System.currentTimeMillis) // press Ctrl+Enter to evaluate println that prints its argument as a line
1610582084465
frameIt: (u: String, h: Int)String
Let's get our hands dirty in Scala
We will go through the following programming concepts and tasks by building on https://docs.scala-lang.org/tour/basics.html.
- Scala Types
- Expressions and Printing
- Naming and Assignments
- Functions and Methods in Scala
- Classes and Case Classes
- Methods and Tab-completion
- Objects and Traits
- Collections in Scala and Type Hierarchy
- Functional Programming and MapReduce
- Lazy Evaluations and Recursions
Remark: You need to take a computer science course (from CourseEra, for example) to properly learn Scala. Here, we will learn to use Scala by example to accomplish our data science tasks at hand. You can learn more Scala as needed from various sources pointed out above in Scala Resources.
Scala Types
In Scala, all values have a type, including numerical values and functions. The diagram below illustrates a subset of the type hierarchy.
For now, notice some common types we will be usinf including Int
, String
, Double
, Unit
, Boolean
, List
, etc. For more details see https://docs.scala-lang.org/tour/unified-types.html. We will return to this at the end of the notebook after seeing a brief tour of Scala now.
Expressions
Expressions are computable statements such as the 1+1
we have seen before.
1+1
res3: Int = 2
We can print the output of a computed or evaluated expressions as a line using println
:
println(1+1) // printing 2
2
println("hej hej!") // printing a string
hej hej!
Naming and Assignments
value and variable as val
and var
You can name the results of expressions using keywords val
and var
.
Let us assign the integer value 5
to x
as follows:
val x : Int = 5 // <Ctrl+Enter> to declare a value x to be integer 5.
x: Int = 5
x
is a named result and it is a value since we used the keyword val
when naming it.
Scala is statically typed, but it uses built-in type inference machinery to automatically figure out that x
is an integer or Int
type as follows. Let's declare a value x
to be Int
5 next without explictly using Int
.
val x = 5 // <Ctrl+Enter> to declare a value x as Int 5 (type automatically inferred)
x: Int = 5
Let's declare x
as a Double
or double-precision floating-point type using decimal such as 5.0
(a digit has to follow the decimal point!)
val x = 5.0 // <Ctrl+Enter> to declare a value x as Double 5
x: Double = 5.0
Alternatively, we can assign x
as a Double
explicitly. Note that the decimal point is not needed in this case due to explicit typing as Double
.
val x : Double = 5 // <Ctrl+Enter> to declare a value x as Double 5 (type automatically inferred)
x: Double = 5.0
Next note that labels need to be declared on first use. We have declared x
to be a val
which is short for value. This makes x
immutable (cannot be changed).
Thus, x
cannot be just re-assigned, as the following code illustrates in the resulting error: ... error: reassignment to val
.
//x = 10 // uncomment and <Ctrl+Enter> to try to reassign val x to 10
Scala allows declaration of mutable variables as well using var
, as follows:
var y = 2 // <Shift+Enter> to declare a variable y to be integer 2 and go to next cell
y: Int = 2
y = 3 // <Shift+Enter> to change the value of y to 3
y: Int = 3
y = y+1 // adds 1 to y
y: Int = 4
y += 2 // adds 2 to y
println(y) // the var y is 6 now
6
Blocks
Just combine expressions by surrounding them with {
and }
called a block.
println({
val x = 1+1
x+2 // expression in last line is returned for the block
})// prints 4
4
println({ val x=22; x+2})
24
Functions
Functions are expressions that have parameters. A function takes arguments as input and returns expressions as output.
A function can be nameless or anonymous and simply return an output from a given input. For example, the following annymous function returns the square of the input integer.
(x: Int) => x*x
res11: Int => Int = line186c28489fff404184da2d59bd09a90463.$read$$Lambda$5065/1820207503@597d20b
On the left of =>
is a list of parameters with name and type. On the right is an expression involving the parameters.
You can also name functions:
val multiplyByItself = (x: Int) => x*x
multiplyByItself: Int => Int = line186c28489fff404184da2d59bd09a90465.$read$$Lambda$5067/2036039718@12f273c8
println(multiplyByItself(10))
100
A function can have no parameters:
val howManyAmI = () => 1
howManyAmI: () => Int = line186c28489fff404184da2d59bd09a90469.$read$$Lambda$5070/1826556511@56f9e3f2
println(howManyAmI()) // 1
1
A function can have more than one parameter:
val multiplyTheseTwoIntegers = (a: Int, b: Int) => a*b
multiplyTheseTwoIntegers: (Int, Int) => Int = line186c28489fff404184da2d59bd09a90473.$read$$Lambda$5071/161461748@62178cca
println(multiplyTheseTwoIntegers(2,4)) // 8
8
Methods
Methods are very similar to functions, but a few key differences exist.
Methods use the def
keyword followed by a name, parameter list(s), a return type, and a body.
def square(x: Int): Int = x*x // <Shitf+Enter> to define a function named square
square: (x: Int)Int
Note that the return type Int
is specified after the parameter list and a :
.
square(5) // <Shitf+Enter> to call this function on argument 5
res15: Int = 25
val y = 3 // <Shitf+Enter> make val y as Int 3
y: Int = 3
square(y) // <Shitf+Enter> to call the function on val y of the right argument type Int
res16: Int = 9
val x = 5.0 // let x be Double 5.0
x: Double = 5.0
//square(x) // <Shift+Enter> to call the function on val x of type Double will give type mismatch error
def square(x: Int): Int = { // <Shitf+Enter> to declare function in a block
val answer = x*x
answer // the last line of the function block is returned
}
square: (x: Int)Int
square(5000) // <Shift+Enter> to call the function
res18: Int = 25000000
// <Shift+Enter> to define function with input and output type as String
def announceAndEmit(text: String): String =
{
println(text)
text // the last line of the function block is returned
}
announceAndEmit: (text: String)String
Scala has a return
keyword but it is rarely used as the expression in the last line of the multi-line block is the method's return value.
// <Ctrl+Enter> to call function which prints as line and returns as String
announceAndEmit("roger roger")
roger roger
res19: String = roger roger
A method can have output expressions involving multiple parameter lists:
def multiplyAndTranslate(x: Int, y: Int)(translateBy: Int): Int = (x * y) + translateBy
multiplyAndTranslate: (x: Int, y: Int)(translateBy: Int)Int
println(multiplyAndTranslate(2, 3)(4)) // (2*3)+4 = 10
10
A method can have no parameter lists at all:
def time: Long = System.currentTimeMillis
time: Long
println("Current time in milliseconds is " + time)
Current time in milliseconds is 1610582096790
println("Current time in milliseconds is " + time)
Current time in milliseconds is 1610582097046
Classes
The class
keyword followed by the name and constructor parameters is used to define a class.
class Box(h: Int, w: Int, d: Int) {
def printVolume(): Unit = println(h*w*d)
}
defined class Box
- The return type of the method
printVolume
isUnit
. - When the return type is
Unit
it indicates that there is nothing meaningful to return, similar tovoid
in Java and C, but with a difference. - Because every Scala expression must have some value, there is actually a singleton value of type
Unit
, written()
and carrying no information.
We can make an instance of the class with the new
keyword.
val my1Cube = new Box(1,1,1)
my1Cube: Box = line186c28489fff404184da2d59bd09a904107.$read$Box@6c4cbb75
And call the method on the instance.
my1Cube.printVolume() // 1
1
Our named instance my1Cube
of the Box
class is immutable due to val
.
You can have mutable instances of the class using var
.
var myVaryingCuboid = new Box(1,3,2)
myVaryingCuboid: Box = line186c28489fff404184da2d59bd09a904107.$read$Box@77404a48
myVaryingCuboid.printVolume()
6
myVaryingCuboid = new Box(1,1,1)
myVaryingCuboid: Box = line186c28489fff404184da2d59bd09a904107.$read$Box@748cdfd1
myVaryingCuboid.printVolume()
1
See https://docs.scala-lang.org/tour/classes.html for more details as needed.
Case Classes
Scala has a special type of class called a case class that can be defined with the case class
keyword.
Unlike classes, whose instances are compared by reference, instances of case classes are immutable by default and compared by value. This makes them useful for defining rows of typed values in Spark.
case class Point(x: Int, y: Int, z: Int)
defined class Point
Case classes can be instantiated without the new
keyword.
val point = Point(1, 2, 3)
val anotherPoint = Point(1, 2, 3)
val yetAnotherPoint = Point(2, 2, 2)
point: Point = Point(1,2,3)
anotherPoint: Point = Point(1,2,3)
yetAnotherPoint: Point = Point(2,2,2)
Instances of case classes are compared by value and not by reference.
if (point == anotherPoint) {
println(point + " and " + anotherPoint + " are the same.")
} else {
println(point + " and " + anotherPoint + " are different.")
} // Point(1,2,3) and Point(1,2,3) are the same.
if (point == yetAnotherPoint) {
println(point + " and " + yetAnotherPoint + " are the same.")
} else {
println(point + " and " + yetAnotherPoint + " are different.")
} // Point(1,2,3) and Point(2,2,2) are different.
Point(1,2,3) and Point(1,2,3) are the same.
Point(1,2,3) and Point(2,2,2) are different.
By contrast, instances of classes are compared by reference.
myVaryingCuboid.printVolume() // should be 1 x 1 x 1
1
my1Cube.printVolume() // should be 1 x 1 x 1
1
if (myVaryingCuboid == my1Cube) {
println("myVaryingCuboid and my1Cube are the same.")
} else {
println("myVaryingCuboid and my1Cube are different.")
} // they are compared by reference and are not the same.
myVaryingCuboid and my1Cube are different.
More about case classes here: https://docs.scala-lang.org/tour/case-classes.html.
Methods and Tab-completion
Many methods of a class can be accessed by .
.
val s = "hi" // <Ctrl+Enter> to declare val s to String "hi"
s: String = hi
You can place the cursor after .
following a declared object and find out the methods available for it as shown in the image below.
You Try doing this next.
//s. // place cursor after the '.' and press Tab to see all available methods for s
For example,
- scroll down to
contains
and double-click on it. - This should lead to
s.contains
in your cell. - Now add an argument String to see if
s
contains the argument, for example, try:s.contains("f")
s.contains("")
ands.contains("i")
//s // <Shift-Enter> recall the value of String s
s.contains("f") // <Shift-Enter> returns Boolean false since s does not contain the string "f"
res32: Boolean = false
s.contains("") // <Shift-Enter> returns Boolean true since s contains the empty string ""
res33: Boolean = true
s.contains("i") // <Ctrl+Enter> returns Boolean true since s contains the string "i"
res34: Boolean = true
Objects
Objects are single instances of their own definitions using the object
keyword. You can think of them as singletons of their own classes.
object IdGenerator {
private var currentId = 0
def make(): Int = {
currentId += 1
currentId
}
}
defined object IdGenerator
You can access an object through its name:
val newId: Int = IdGenerator.make()
val newerId: Int = IdGenerator.make()
newId: Int = 1
newerId: Int = 2
println(newId) // 1
println(newerId) // 2
1
2
For details see https://docs.scala-lang.org/tour/singleton-objects.html
Traits
Traits are abstract data types containing certain fields and methods. They can be defined using the trait
keyword.
In Scala inheritance, a class can only extend one other class, but it can extend multiple traits.
trait Greeter {
def greet(name: String): Unit
}
defined trait Greeter
Traits can have default implementations also.
trait Greeter {
def greet(name: String): Unit =
println("Hello, " + name + "!")
}
defined trait Greeter
You can extend traits with the extends
keyword and override an implementation with the override
keyword:
class DefaultGreeter extends Greeter
class SwedishGreeter extends Greeter {
override def greet(name: String): Unit = {
println("Hej hej, " + name + "!")
}
}
class CustomizableGreeter(prefix: String, postfix: String) extends Greeter {
override def greet(name: String): Unit = {
println(prefix + name + postfix)
}
}
defined class DefaultGreeter
defined class SwedishGreeter
defined class CustomizableGreeter
Instantiate the classes.
val greeter = new DefaultGreeter()
val swedishGreeter = new SwedishGreeter()
val customGreeter = new CustomizableGreeter("How are you, ", "?")
greeter: DefaultGreeter = line186c28489fff404184da2d59bd09a904155.$read$DefaultGreeter@5d7c7786
swedishGreeter: SwedishGreeter = line186c28489fff404184da2d59bd09a904155.$read$SwedishGreeter@a1c1128
customGreeter: CustomizableGreeter = line186c28489fff404184da2d59bd09a904155.$read$CustomizableGreeter@7c2dc867
Call the greet
method in each case.
greeter.greet("Scala developer") // Hello, Scala developer!
swedishGreeter.greet("Scala developer") // Hej hej, Scala developer!
customGreeter.greet("Scala developer") // How are you, Scala developer?
Hello, Scala developer!
Hej hej, Scala developer!
How are you, Scala developer?
A class can also be made to extend multiple traits.
For more details see: https://docs.scala-lang.org/tour/traits.html.
Main Method
The main method is the entry point of a Scala program.
The Java Virtual Machine requires a main method, named main
, that takes an array of strings as its only argument.
Using an object, you can define the main method as follows:
object Main {
def main(args: Array[String]): Unit =
println("Hello, Scala developer!")
}
defined object Main
What I try not do while learning a new language?
- I don't immediately try to ask questions like: how can I do this particular variation of some small thing I just learned so I can use patterns I am used to from another language I am hooked-on right now?
- first go through the detailed Scala Tour on your own and then through the 50 odd lessons in the Scala Book
- then return to 1. and ask detailed cross-language comparison questions by diving deep as needed with the source and scala docs as needed (google or duck-duck-go search!).