%md # Notebooks Write Spark code for processing your data in notebooks. **NOTE**: You should have already cloned this notebook and attached it to a cluster that you started in the Community Edition of databricks by now.
Notebooks
Write Spark code for processing your data in notebooks.
NOTE: You should have already cloned this notebook and attached it to a cluster that you started in the Community Edition of databricks by now.
Last refresh: Never
%md ### Notebooks can be written in **Python**, **Scala**, **R**, or **SQL**. * This is a Scala notebook - which is indicated next to the title above by ``(Scala)``.
Notebooks can be written in Python, Scala, R, or SQL.
- This is a Scala notebook - which is indicated next to the title above by
(Scala)
.
Last refresh: Never
%md ### **Creating a new Notebook**  * Click the tiangle on the right side of a folder to open the folder menu. * Select **Create > Notebook**. * Enter the name of the notebook, the language (Python, Scala, R or SQL) for the notebook, and a cluster to run it on.
Creating a new Notebook
- Click the tiangle on the right side of a folder to open the folder menu.
- Select Create > Notebook.
- Enter the name of the notebook, the language (Python, Scala, R or SQL) for the notebook, and a cluster to run it on.
Last refresh: Never
%md ### Cloning a Notebook * You can clone a notebook to create a copy of it, for example if you want to edit or run an Example notebook like this one. * Click **File > Clone** in the notebook context bar above. * Enter a new name and location for your notebook. If Access Control is enabled, you can only clone to folders that you have Manage permissions on.
Cloning a Notebook
- You can clone a notebook to create a copy of it, for example if you want to edit or run an Example notebook like this one.
- Click File > Clone in the notebook context bar above.
- Enter a new name and location for your notebook. If Access Control is enabled, you can only clone to folders that you have Manage permissions on.
Last refresh: Never
%md ### Clone Or Import This Notebook * From the **File** menu at the top left of this notebook, choose **Clone** or click **Import Notebook** on the top right. This will allow you to interactively execute code cells as you proceed through the notebook.  * Enter a name and a desired location for your cloned notebook (i.e. Perhaps clone to your own user directory or the "Shared" directory.) * Navigate to the location you selected (e.g. click Menu > Workspace > `Your cloned location`)
Clone Or Import This Notebook
- From the File menu at the top left of this notebook, choose Clone or click Import Notebook on the top right. This will allow you to interactively execute code cells as you proceed through the notebook.
- Enter a name and a desired location for your cloned notebook (i.e. Perhaps clone to your own user directory or the "Shared" directory.)
- Navigate to the location you selected (e.g. click Menu > Workspace >
Your cloned location
)
Last refresh: Never
%md ### **Attach** the Notebook to a **cluster** * A **Cluster** is a group of machines which can run commands in cells. * Check the upper left corner of your notebook to see if it is **Attached** or **Detached**. * If **Detached**, click on the right arrow and select a cluster to attach your notebook to. * If there is no running cluster, create one as described in the [Welcome to Databricks](/#workspace/databricks_guide/00 Welcome to Databricks) guide. 
Attach the Notebook to a cluster
- A Cluster is a group of machines which can run commands in cells.
- Check the upper left corner of your notebook to see if it is Attached or Detached.
- If Detached, click on the right arrow and select a cluster to attach your notebook to.
- If there is no running cluster, create one as described in the Welcome to Databricks guide.
Last refresh: Never
%md *** ####  **Cells** are units that make up notebooks  Cells each have a type - including **scala**, **python**, **sql**, **R**, **markdown**, **filesystem**, and **shell**. * While cells default to the type of the Notebook, other cell types are supported as well. * This cell is in **markdown** and is used for documentation. [Markdown](http://en.wikipedia.org/wiki/Markdown) is a simple text formatting syntax. ***
Cells are units that make up notebooks
Cells each have a type - including scala, python, sql, R, markdown, filesystem, and shell.
- While cells default to the type of the Notebook, other cell types are supported as well.
- This cell is in markdown and is used for documentation. Markdown is a simple text formatting syntax.
Last refresh: Never
%md *** ### **Create** and **Edit** a New Markdown Cell in this Notebook * When you mouse between cells, a + sign will pop up in the center that you can click on to create a new cell.  * Type **``%md Hello, world!``** into your new cell (**``%md``** indicates the cell is markdown). * Click out of the cell to see the cell contents update.  ***
Create and Edit a New Markdown Cell in this Notebook
When you mouse between cells, a + sign will pop up in the center that you can click on to create a new cell.
- Type
%md Hello, world!
into your new cell (%md
indicates the cell is markdown).
Click out of the cell to see the cell contents update.
Last refresh: Never
%md ### **Running a cell in your notebook.** * #### Press **Shift+Enter** when in the cell to **run** it and proceed to the next cell. * The cells contents should update.  * **NOTE:** Cells are not automatically run each time you open it. * Instead, Previous results from running a cell are saved and displayed. * #### Alternately, press **Ctrl+Enter** when in a cell to **run** it, but not proceed to the next cell.
Running a cell in your notebook.
Press Shift+Enter when in the cell to run it and proceed to the next cell.
- The cells contents should update.
- The cells contents should update.
- NOTE: Cells are not automatically run each time you open it.
- Instead, Previous results from running a cell are saved and displayed.
Alternately, press Ctrl+Enter when in a cell to run it, but not proceed to the next cell.
Last refresh: Never
%md **You Try Now!** Just double-click the cell below, modify the text following ``%md`` and press **Ctrl+Enter** to evaluate it and see it's mark-down'd output. ``` > %md Hello, world! ```
You Try Now!
Just double-click the cell below, modify the text following %md
and press Ctrl+Enter to evaluate it and see it's mark-down'd output.
> %md Hello, world!
Last refresh: Never
%md *** ####  **Markdown Cell Tips** * To change a non-markdown cell to markdown, add **%md** to very start of the cell. * After updating the contents of a markdown cell, click out of the cell to update the formatted contents of a markdown cell. * To edit an existing markdown cell, **doubleclick** the cell. Learn more about markdown: * [https://guides.github.com/features/mastering-markdown/](https://guides.github.com/features/mastering-markdown/) Note that there are flavours or minor variants and enhancements of markdown, including those specific to databricks, github, [pandoc](https://pandoc.org/MANUAL.html), etc. It will be future-proof to remain in the syntactic zone of *pure markdown* (at the intersection of various flavours) as much as possible and go with [pandoc](https://pandoc.org/MANUAL.html)-compatible style if choices are necessary. ***
Markdown Cell Tips
- To change a non-markdown cell to markdown, add %md to very start of the cell.
- After updating the contents of a markdown cell, click out of the cell to update the formatted contents of a markdown cell.
- To edit an existing markdown cell, doubleclick the cell.
Learn more about markdown:
Note that there are flavours or minor variants and enhancements of markdown, including those specific to databricks, github, pandoc, etc.
It will be future-proof to remain in the syntactic zone of pure markdown (at the intersection of various flavours) as much as possible and go with pandoc-compatible style if choices are necessary.
Last refresh: Never
%md *** ### Run a **Scala Cell** * Run the following scala cell. * Note: There is no need for any special indicator (such as ``%md``) necessary to create a Scala cell in a Scala notebook. * You know it is a scala notebook because of the `` (Scala)`` appended to the name of this notebook. * Make sure the cell contents updates before moving on. * Press **Shift+Enter** when in the cell to run it and proceed to the next cell. * The cells contents should update. * Alternately, press **Ctrl+Enter** when in a cell to **run** it, but not proceed to the next cell. * characters following ``//`` are comments in scala. ***
Run a Scala Cell
- Run the following scala cell.
- Note: There is no need for any special indicator (such as
%md
) necessary to create a Scala cell in a Scala notebook. - You know it is a scala notebook because of the
(Scala)
appended to the name of this notebook. - Make sure the cell contents updates before moving on.
- Press Shift+Enter when in the cell to run it and proceed to the next cell.
- The cells contents should update.
- Alternately, press Ctrl+Enter when in a cell to run it, but not proceed to the next cell.
- characters following
//
are comments in scala.
Last refresh: Never
%md ## Scala Resources You will not be learning scala systematically and thoroughly in this course. You will learn *to use* Scala by doing various Spark jobs. If you are seriously interested in learning scala properly, then there are various resources, including: * [scala-lang.org](http://www.scala-lang.org/) is the **core Scala resource**. * [tour-of-scala](http://docs.scala-lang.org/tutorials/tour/tour-of-scala) * MOOC * [courseera: Functional Programming Principles in Scala](https://www.coursera.org/course/progfun) * [Books](http://www.scala-lang.org/documentation/books.html) * [Programming in Scala, 1st Edition, Free Online Reading](http://www.artima.com/pins1ed/) The main sources for the following content are (you are encouraged to read them for more background): * [Martin Oderski's Scala by example](http://www.scala-lang.org/docu/files/ScalaByExample.pdf) * [Scala crash course by Holden Karau](http://lintool.github.io/SparkTutorial/slides/day1_Scala_crash_course.pdf) * [Darren's brief introduction to scala and breeze for statistical computing](https://darrenjw.wordpress.com/2013/12/30/brief-introduction-to-scala-and-breeze-for-statistical-computing/)
Scala Resources
You will not be learning scala systematically and thoroughly in this course. You will learn to use Scala by doing various Spark jobs.
If you are seriously interested in learning scala properly, then there are various resources, including:
- scala-lang.org is the core Scala resource.
- MOOC
- Books
The main sources for the following content are (you are encouraged to read them for more background):
Last refresh: Never
%md #Introduction to Scala ## What is Scala? "Scala smoothly integrates object-oriented and functional programming. It is designed to express common programming patterns in a concise, elegant, and type-safe way." by Matrin Odersky. * High-level language for the Java Virtual Machine (JVM) * Object oriented + functional programming * Statically typed * Comparable in speed to Java * Type inference saves us from having to write explicit types most of the time Interoperates with Java * Can use any Java class (inherit from, etc.) * Can be called from Java code ## Why Scala? * Spark was originally written in Scala, which allows concise function syntax and interactive use * Spark APIs for other languages include: * Java API for standalone use * Python API added to reach a wider user community of programmes * R API added more recently to reach a wider community of data analyststs * Unfortunately, Python and R APIs are generally behind Spark's native Scala (for eg. GraphX is only available in Scala currently). * See Darren Wilkinson's 11 reasons for [scala as a platform for statistical computing and data science](https://darrenjw.wordpress.com/2013/12/23/scala-as-a-platform-for-statistical-computing-and-data-science/). It is embedded in-place below for your convenience.
Introduction to Scala
What is Scala?
"Scala smoothly integrates object-oriented and functional programming. It is designed to express common programming patterns in a concise, elegant, and type-safe way." by Matrin Odersky.
- High-level language for the Java Virtual Machine (JVM)
- Object oriented + functional programming
- Statically typed
- Comparable in speed to Java
- Type inference saves us from having to write explicit types most of the time Interoperates with Java
- Can use any Java class (inherit from, etc.)
- Can be called from Java code
Why Scala?
- Spark was originally written in Scala, which allows concise function syntax and interactive use
- Spark APIs for other languages include:
- Java API for standalone use
- Python API added to reach a wider user community of programmes
- R API added more recently to reach a wider community of data analyststs
- Unfortunately, Python and R APIs are generally behind Spark's native Scala (for eg. GraphX is only available in Scala currently).
- See Darren Wilkinson's 11 reasons for scala as a platform for statistical computing and data science. It is embedded in-place below for your convenience.
Last refresh: Never
//%run "/scalable-data-science/xtraResources/support/sdsFunctions" //This allows easy embedding of publicly available information into any other notebook //when viewing in git-book just ignore this block - you may have to manually chase the URL in frameIt("URL"). //Example usage: // displayHTML(frameIt("https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation#Topics_in_LDA",250)) def frameIt( u:String, h:Int ) : String = { """<iframe src=""""+ u+"""" width="95%" height="""" + h + """" sandbox> <p> <a href="http://spark.apache.org/docs/latest/index.html"> Fallback link for browsers that, unlikely, don't support frames </a> </p> </iframe>""" }
displayHTML(frameIt("https://darrenjw.wordpress.com/2013/12/23/scala-as-a-platform-for-statistical-computing-and-data-science/",500))
Last refresh: Never
%md # Let's get our hands dirty in Scala We will go through the following programming concepts and tasks: * Assignments * Methods and Tab-completion * Functions in Scala * Collections in Scala * Scala Closures for Functional Programming and MapReduce **Remark**: You need to take a computer science course (from CourseEra, for example) to properly learn Scala. Here, we will learn to use Scala by example to accomplish our data science tasks at hand.
Let's get our hands dirty in Scala
We will go through the following programming concepts and tasks:
- Assignments
- Methods and Tab-completion
- Functions in Scala
- Collections in Scala
- Scala Closures for Functional Programming and MapReduce
Remark: You need to take a computer science course (from CourseEra, for example) to properly learn Scala. Here, we will learn to use Scala by example to accomplish our data science tasks at hand.
Last refresh: Never
%md Scala is statically typed, but it uses built-in type inference machinery to automatically figure out that ``x`` is an integer or ``Int`` type as follows. Let's declare a value ``x`` to be ``Int`` 5 next without explictly using ``Int``.
Scala is statically typed, but it uses built-in type inference machinery to automatically figure out that x
is an integer or Int
type as follows.
Let's declare a value x
to be Int
5 next without explictly using Int
.
Last refresh: Never
%md Let's declare ``x`` as a ``Double`` or double-precision floating-point type using decimal such as ``5.0`` (a digit has to follow the decimal point!)
Let's declare x
as a Double
or double-precision floating-point type using decimal such as 5.0
(a digit has to follow the decimal point!)
Last refresh: Never
%md Alternatively, we can assign ``x`` as a ``Double`` explicitly. Note that the decimal point is not needed in this case due to explicit typing as ``Double``.
Alternatively, we can assign x
as a Double
explicitly. Note that the decimal point is not needed in this case due to explicit typing as Double
.
Last refresh: Never
%md Next note that labels need to be declared on first use. We have declared `x` to be a `val` which is short for *value*. This makes `x` immutable (cannot be changed). Thus, `x` cannot be just re-assigned, as the following code illustrates in the resulting error: `... error: reassignment to val`.
Next note that labels need to be declared on first use. We have declared x
to be a val
which is short for value. This makes x
immutable (cannot be changed).
Thus, x
cannot be just re-assigned, as the following code illustrates in the resulting error: ... error: reassignment to val
.
Last refresh: Never
%md You can place the cursor after ``.`` following a declared object and find out the methods available for it as shown in the image below.  **You Try** doing this next.
You can place the cursor after .
following a declared object and find out the methods available for it as shown in the image below.
You Try doing this next.
Last refresh: Never
%md For example, * scroll down to ``contains`` and double-click on it. * This should lead to ``s.contains`` in your cell. * Now add an argument String to see if ``s`` contains the argument, for example, try: * ``s.contains("f")`` * ``s.contains("")`` and * ``s.contains("i")``
For example,
- scroll down to
contains
and double-click on it. - This should lead to
s.contains
in your cell. - Now add an argument String to see if
s
contains the argument, for example, try:s.contains("f")
s.contains("")
ands.contains("i")
Last refresh: Never
SDS-2.x, Scalable Data Engineering Science
Last refresh: Never