This notebook is from databricks document: - https://docs.databricks.com/_static/notebooks/package-cells.html
As you know, we need to eventually package and deploy our models, say using sbt
.
However it is nice to be in a notebook environment to prototype and build intuition and create better pipelines.
Using package cells we can be in the best of both worlds to an extent.
NOTE: This is not applicable to Zeppelin notes.
Package Cells
Package cells are special cells that get compiled when executed. These cells have no visibility with respect to the rest of the notebook. You may think of them as separate scala files.
This means that only class
and object
definitions may go inside this cell. You may not have any variable or function definitions lying around by itself. The following cell will not work.
If you wish to use custom classes and/or objects defined within notebooks reliably in Spark, and across notebook sessions, you must use package cells to define those classes.
Unless you use package cells to define classes, you may also come across obscure bugs as follows:
// We define a class
case class TestKey(id: Long, str: String)
defined class TestKey
// we use that class as a key in the group by
val rdd = sc.parallelize(Array((TestKey(1L, "abd"), "dss"), (TestKey(2L, "ggs"), "dse"), (TestKey(1L, "abd"), "qrf")))
rdd.groupByKey().collect
rdd: org.apache.spark.rdd.RDD[(TestKey, String)] = ParallelCollectionRDD[121] at parallelize at command-2971213210276613:2
res0: Array[(TestKey, Iterable[String])] = Array((TestKey(1,abd),CompactBuffer(dss, qrf)), (TestKey(2,ggs),CompactBuffer(dse)))
What went wrong above? Even though we have two elements for the key TestKey(1L, "abd")
, they behaved as two different keys resulting in:
Array[(TestKey, Iterable[String])] = Array(
(TestKey(2,ggs),CompactBuffer(dse)),
(TestKey(1,abd),CompactBuffer(dss)),
(TestKey(1,abd),CompactBuffer(qrf)))
Once we define our case class within a package cell, we will not face this issue.
package com.databricks.example
case class TestKey(id: Long, str: String)
Warning: classes defined within packages cannot be redefined without a cluster restart.
Compilation successful.
import com.databricks.example
val rdd = sc.parallelize(Array(
(example.TestKey(1L, "abd"), "dss"), (example.TestKey(2L, "ggs"), "dse"), (example.TestKey(1L, "abd"), "qrf")))
rdd.groupByKey().collect
import com.databricks.example
rdd: org.apache.spark.rdd.RDD[(com.databricks.example.TestKey, String)] = ParallelCollectionRDD[123] at parallelize at command-2971213210276616:3
res2: Array[(com.databricks.example.TestKey, Iterable[String])] = Array((TestKey(1,abd),CompactBuffer(dss, qrf)), (TestKey(2,ggs),CompactBuffer(dse)))
As you can see above, the group by worked above, grouping two elements (dss, qrf
) under TestKey(1,abd)
.
These cells behave as individual source files, therefore only classes and objects can be defined inside these cells.
package x.y.z
val aNumber = 5 // won't work
def functionThatWillNotWork(a: Int): Int = a + 1
The following cell is the way to go.
package x.y.z
object Utils {
val aNumber = 5 // works!
def functionThatWillWork(a: Int): Int = a + 1
}
Warning: classes defined within packages cannot be redefined without a cluster restart.
Compilation successful.
import x.y.z.Utils
Utils.functionThatWillWork(Utils.aNumber)
import x.y.z.Utils
res9: Int = 6
Why did we get the warning: classes defined within packages cannot be redefined without a cluster restart
?
Classes that get compiled with the package
cells get dynamically injected into Spark's classloader. Currently it's not possible to remove classes from Spark's classloader. Any classes that you define and compile will have precedence in the classloader, therefore once you recompile, it will not be visible to your application.
Well that kind of beats the purpose of iterative notebook development if I have to restart the cluster, right? In that case, you may just rename the package to x.y.z2
during development/fast iteration and fix it once everything works.
One thing to remember with package cells is that it has no visiblity regarding the notebook environment.
- The SparkContext will not be defined as
sc
. - The SQLContext will not be defined as
sqlContext
. - Did you import a package in a separate cell? Those imports will not be available in the package cell and have to be remade.
- Variables imported through
%run
cells will not be available.
It is really a standalone file that just looks like a cell in a notebook. This means that any function that uses anything that was defined in a separate cell, needs to take that variable as a parameter or the class needs to take it inside the constructor.
package x.y.zpackage
import org.apache.spark.SparkContext
case class IntArray(values: Array[Int])
class MyClass(sc: SparkContext) {
def sparkSum(array: IntArray): Int = {
sc.parallelize(array.values).reduce(_ + _)
}
}
object MyClass {
def sparkSum(sc: SparkContext, array: IntArray): Int = {
sc.parallelize(array.values).reduce(_ + _)
}
}
Warning: classes defined within packages cannot be redefined without a cluster restart.
Compilation successful.
import x.y.zpackage._
val array = IntArray(Array(1, 2, 3, 4, 5))
val myClass = new MyClass(sc)
myClass.sparkSum(array)
import x.y.zpackage._
array: x.y.zpackage.IntArray = IntArray([I@5bb7ffc8)
myClass: x.y.zpackage.MyClass = x.y.zpackage.MyClass@5124c9d2
res11: Int = 15
MyClass.sparkSum(sc, array)
res12: Int = 15
Build Packages Locally
Although package cells are quite handy for quick prototyping in notebook environemnts, it is better to develop packages locally on your laptop and then upload the packaged or assembled jar file into databricks to use the classes and methods developed in the package.
For examples of how to build Scala Spark packages and libraries using mvn or sbt see for example:
- using mvn and Scala:
- using sbt and Scala:
- using sbt and Scala while allowing interaction with other language libraries plus terraformed infrastructure: