ScaDaMaLe Course site and book

Please go here for a relaxed and detailed-enough tour (later):

Multi-lingual Notebooks

Write Spark code for processing your data in notebooks.

Note that there are several open-sourced notebook servers including:

Here, we are mainly focused on using databricks notebooks due to its effeciently managed engineering layers over AWS (or Azure public clouds).

NOTE: You should have already cloned this notebook and attached it to a cluster that you started in the Community Edition of databricks by now.

Databricks Notebook

Next we delve into the mechanics of working with databricks notebooks. But many of the details also apply to other notebook environments with minor differences.

Notebooks can be written in Python, Scala, R, or SQL.

  • This is a Scala notebook - which is indicated next to the title above by (Scala).
  • One can choose the default language of the notebook when it is created.

Creating a new Notebook

Change Name

  • Click the tiangle on the right side of a folder to open the folder menu.
  • Select Create > Notebook.
  • Enter the name of the notebook, the language (Python, Scala, R or SQL) for the notebook, and a cluster to run it on.

Cloning a Notebook

  • You can clone a notebook to create a copy of it, for example if you want to edit or run an Example notebook like this one.
  • Click File > Clone in the notebook context bar above.
  • Enter a new name and location for your notebook. If Access Control is enabled, you can only clone to folders that you have Manage permissions on.

Clone Or Import This Notebook

  • From the File menu at the top left of this notebook, choose Clone or click Import Notebook on the top right. This will allow you to interactively execute code cells as you proceed through the notebook.

Menu Bar Clone Notebook * Enter a name and a desired location for your cloned notebook (i.e. Perhaps clone to your own user directory or the "Shared" directory.) * Navigate to the location you selected (e.g. click Menu > Workspace > Your cloned location)

Attach the Notebook to a cluster

  • A Cluster is a group of machines which can run commands in cells.
  • Check the upper left corner of your notebook to see if it is Attached or Detached.
  • If Detached, click on the right arrow and select a cluster to attach your notebook to.

Attach Notebook

Deep-dive into databricks notebooks

Let's take a deeper dive into a databricks notebook next.


Quick Note Cells are units that make up notebooks

A Cell

Cells each have a type - including scala, python, sql, R, markdown, filesystem, and shell.

  • While cells default to the type of the Notebook, other cell types are supported as well.
  • This cell is in markdown and is used for documentation. Markdown is a simple text formatting syntax.


Create and Edit a New Markdown Cell in this Notebook

  • When you mouse between cells, a + sign will pop up in the center that you can click on to create a new cell.

New Cell * Type %md Hello, world! into your new cell (%md indicates the cell is markdown).

  • Click out of the cell to see the cell contents update.

    Run cell


Hello, world!

Running a cell in your notebook.

You Try Now! Just double-click the cell below, modify the text following %md and press Ctrl+Enter to evaluate it and see it's mark-down'd output.

> %md Hello, world!

Hello, world!


Quick Note Markdown Cell Tips

  • To change a non-markdown cell to markdown, add %md to very start of the cell.
  • After updating the contents of a markdown cell, click out of the cell to update the formatted contents of a markdown cell.
  • To edit an existing markdown cell, doubleclick the cell.

Learn more about markdown:

Note that there are flavours or minor variants and enhancements of markdown, including those specific to databricks, github, pandoc, etc.

It will be future-proof to remain in the syntactic zone of pure markdown (at the intersection of various flavours) as much as possible and go with pandoc-compatible style if choices are necessary. ***


Run a Scala Cell

  • Run the following scala cell.
  • Note: There is no need for any special indicator (such as %md) necessary to create a Scala cell in a Scala notebook.
  • You know it is a scala notebook because of the (Scala) appended to the name of this notebook.
  • Make sure the cell contents updates before moving on.
  • Press Shift+Enter when in the cell to run it and proceed to the next cell.
    • The cells contents should update.
    • Alternately, press Ctrl+Enter when in a cell to run it, but not proceed to the next cell.
  • characters following // are comments in scala. ***
1+1
res0: Int = 2
println(System.currentTimeMillis) // press Ctrl+Enter to evaluate println that prints its argument as a line
1610582284328
1+1
res2: Int = 2

Spark is written in Scala, but ...

For this reason Scala will be the primary language for this course is Scala.

However, let us use the best language for the job! as each cell can be written in a specific language in the same notebook. Such multi-lingual notebooks are the norm in any realistic data science process today!

The beginning of each cells has a language type if it is not the default language of the notebook. Such cell-specific language types include the following with the prefix %:

  • %scala for Scala,

  • %py for Python,

  • %r for R,

  • %sql for SQL,

  • %fs for databricks' filesystem,

  • %sh for BASH SHELL and

  • %md for markdown.

  • While cells default to the language type of the Notebook (scala, python, r or sql), other cell types are supported as well in a cell-specific manner.

  • For example, Python Notebooks can contain python, sql, markdown, and even scala cells. This lets you write notebooks that do use multiple languages.

  • This cell is in markdown as it begins with %mdand is used for documentation purposes.

Thus, all language-typed cells can be created in any notebook, regardless of the the default language of the notebook itself.

Cross-language cells can be used to mix commands from other languages.

Examples:

print("For example, this is a scala notebook, but we can use %py to run python commands inline.")
For example, this is a scala notebook, but we can use %py to run python commands inline.
print("We can also access other languages such as R.")
// you can be explicit about the language even if the notebook's default language is the same
println("We can access Scala like this.")
We can access Scala like this.

Command line cells can be used to work with local files on the Spark driver node. * Start a cell with %sh to run a command line command

# This is a command line cell. Commands you write here will be executed as if they were run on the command line.
# For example, in this cell we access the help pages for the bash shell.
ls
conf
derby.log
eventlogs
ganglia
logs
whoami
root

Filesystem cells allow access to the Databricks File System (DBFS).

  • Start a cell with %fs to run DBFS commands
  • Type %fs help for a list of commands

Notebooks can be run from other notebooks using %run

  • Syntax: %run /full/path/to/notebook
  • This is commonly used to import functions you defined in other notebooks.

Further Pointers

Here are some useful links to bookmark as you will need to use them for Reference.

These links provide a relaxed and detailed-enough tour (that you are strongly encouraged to take later):