003_scalaCrashCourse(Scala)

Please go here for a relaxed and detailed-enough tour (later):

Notebooks

Write Spark code for processing your data in notebooks.

NOTE: You should have already cloned this notebook and attached it to a cluster that you started in the Community Edition of databricks by now.

Notebooks can be written in Python, Scala, R, or SQL.

  • This is a Scala notebook - which is indicated next to the title above by (Scala).

Creating a new Notebook

Change Name

  • Click the tiangle on the right side of a folder to open the folder menu.
  • Select Create > Notebook.
  • Enter the name of the notebook, the language (Python, Scala, R or SQL) for the notebook, and a cluster to run it on.

Cloning a Notebook

  • You can clone a notebook to create a copy of it, for example if you want to edit or run an Example notebook like this one.
  • Click File > Clone in the notebook context bar above.
  • Enter a new name and location for your notebook. If Access Control is enabled, you can only clone to folders that you have Manage permissions on.

Introduction to Scala through Scala Notebook

  • This introduction notebook describes how to get started running Scala code in Notebooks.

Clone Or Import This Notebook

  • From the File menu at the top left of this notebook, choose Clone or click Import Notebook on the top right. This will allow you to interactively execute code cells as you proceed through the notebook.

Menu Bar Clone Notebook

  • Enter a name and a desired location for your cloned notebook (i.e. Perhaps clone to your own user directory or the "Shared" directory.)
  • Navigate to the location you selected (e.g. click Menu > Workspace > Your cloned location)

Attach the Notebook to a cluster

  • A Cluster is a group of machines which can run commands in cells.
  • Check the upper left corner of your notebook to see if it is Attached or Detached.
  • If Detached, click on the right arrow and select a cluster to attach your notebook to.

Attach Notebook


Quick Note Cells are units that make up notebooks

A Cell

Cells each have a type - including scala, python, sql, R, markdown, filesystem, and shell.

  • While cells default to the type of the Notebook, other cell types are supported as well.
  • This cell is in markdown and is used for documentation. Markdown is a simple text formatting syntax.


Create and Edit a New Markdown Cell in this Notebook

  • When you mouse between cells, a + sign will pop up in the center that you can click on to create a new cell.

    New Cell

  • Type %md Hello, world! into your new cell (%md indicates the cell is markdown).
  • Click out of the cell to see the cell contents update.

    Run cell


Hello, world!

Running a cell in your notebook.

  • Press Shift+Enter when in the cell to run it and proceed to the next cell.

    • The cells contents should update. Run cell
  • NOTE: Cells are not automatically run each time you open it.
    • Instead, Previous results from running a cell are saved and displayed.
  • Alternately, press Ctrl+Enter when in a cell to run it, but not proceed to the next cell.

You Try Now! Just double-click the cell below, modify the text following %md and press Ctrl+Enter to evaluate it and see it's mark-down'd output.

> %md Hello, world!

Hello, world!


Quick Note Markdown Cell Tips

  • To change a non-markdown cell to markdown, add %md to very start of the cell.
  • After updating the contents of a markdown cell, click out of the cell to update the formatted contents of a markdown cell.
  • To edit an existing markdown cell, doubleclick the cell.

Learn more about markdown:

Note that there are flavours or minor variants and enhancements of markdown, including those specific to databricks, github, pandoc, etc.

It will be future-proof to remain in the syntactic zone of pure markdown (at the intersection of various flavours) as much as possible and go with pandoc-compatible style if choices are necessary.



Run a Scala Cell

  • Run the following scala cell.
  • Note: There is no need for any special indicator (such as %md) necessary to create a Scala cell in a Scala notebook.
  • You know it is a scala notebook because of the (Scala) appended to the name of this notebook.
  • Make sure the cell contents updates before moving on.
  • Press Shift+Enter when in the cell to run it and proceed to the next cell.
    • The cells contents should update.
    • Alternately, press Ctrl+Enter when in a cell to run it, but not proceed to the next cell.
  • characters following // are comments in scala.

1+1
res0: Int = 2
println(System.currentTimeMillis) // press Ctrl+Enter to evaluate println that prints its argument as a line
1565169638781

Scala Resources

You will not be learning scala systematically and thoroughly in this course. You will learn to use Scala by doing various Spark jobs.

If you are seriously interested in learning scala properly, then there are various resources, including:

The main sources for the following content are (you are encouraged to read them for more background):

Introduction to Scala

What is Scala?

"Scala smoothly integrates object-oriented and functional programming. It is designed to express common programming patterns in a concise, elegant, and type-safe way." by Matrin Odersky.

  • High-level language for the Java Virtual Machine (JVM)
  • Object oriented + functional programming
  • Statically typed
  • Comparable in speed to Java
  • Type inference saves us from having to write explicit types most of the time Interoperates with Java
  • Can use any Java class (inherit from, etc.)
  • Can be called from Java code

Why Scala?

  • Spark was originally written in Scala, which allows concise function syntax and interactive use
  • Spark APIs for other languages include:
    • Java API for standalone use
    • Python API added to reach a wider user community of programmes
    • R API added more recently to reach a wider community of data analyststs
    • Unfortunately, Python and R APIs are generally behind Spark's native Scala (for eg. GraphX is only available in Scala currently).
  • See Darren Wilkinson's 11 reasons for scala as a platform for statistical computing and data science. It is embedded in-place below for your convenience.
Show codeShow result
Show code

Let's get our hands dirty in Scala

We will go through the following programming concepts and tasks:

  • Assignments
  • Methods and Tab-completion
  • Functions in Scala
  • Collections in Scala
  • Scala Closures for Functional Programming and MapReduce

Remark: You need to take a computer science course (from CourseEra, for example) to properly learn Scala. Here, we will learn to use Scala by example to accomplish our data science tasks at hand.

Assignments

value and variable as val and var

Let us assign the integer value 5 to x as follows:

val x : Int = 5 // <Ctrl+Enter> to declare a value x to be integer 5
x: Int = 5

Scala is statically typed, but it uses built-in type inference machinery to automatically figure out that x is an integer or Int type as follows. Let's declare a value x to be Int 5 next without explictly using Int.

val x = 5    // <Ctrl+Enter> to declare a value x as Int 5 (type automatically inferred)
x: Int = 5

Let's declare x as a Double or double-precision floating-point type using decimal such as 5.0 (a digit has to follow the decimal point!)

val x = 5.0   // <Ctrl+Enter> to declare a value x as Double 5
x: Double = 5.0

Alternatively, we can assign x as a Double explicitly. Note that the decimal point is not needed in this case due to explicit typing as Double.

val x :  Double = 5    // <Ctrl+Enter> to declare a value x as Double 5 (type automatically inferred)
x: Double = 5.0

Next note that labels need to be declared on first use. We have declared x to be a val which is short for value. This makes x immutable (cannot be changed).

Thus, x cannot be just re-assigned, as the following code illustrates in the resulting error: ... error: reassignment to val.

//x = 10    //  uncomment and <Ctrl+Enter> to try to reassign val x to 10

Scala allows declaration of mutable variables as well using var, as follows:

var y = 2    // <Shift+Enter> to declare a variable y to be integer 2 and go to next cell
y: Int = 2
y = 3    // <Shift+Enter> to change the value of y to 3
y: Int = 3

Methods and Tab-completion

val s  = "hi"    // <Ctrl+Enter> to declare val s to String "hi"
s: String = hi

You can place the cursor after . following a declared object and find out the methods available for it as shown in the image below.

tabCompletionAfterSDot PNG image

You Try doing this next.

//s.  // place cursor after the '.' and press Tab to see all available methods for s 

For example,

  • scroll down to contains and double-click on it.
  • This should lead to s.contains in your cell.
  • Now add an argument String to see if s contains the argument, for example, try:
    • s.contains("f")
    • s.contains("") and
    • s.contains("i")
s    // <Shift-Enter> recall the value of String s
res3: String = hi
s.contains("f")     // <Shift-Enter> returns Boolean false since s does not contain the string "f"
res5: Boolean = false
s.contains("")    // <Shift-Enter> returns Boolean true since s contains the empty string ""
res6: Boolean = true
s.contains("i")    // <Ctrl+Enter> returns Boolean true since s contains the string "i"
res7: Boolean = true

Functions

def square(x: Int): Int = x*x    // <Shitf+Enter> to define a function named square
square: (x: Int)Int
square(5)    // <Shitf+Enter> to call this function on argument 5
res8: Int = 25
y    // <Shitf+Enter> to recall that val y is Int 3
res9: Int = 3