006a_PipedRDD - Databricks

// in Scalaval myCollection = "Spark The Definitive Guide : Big Data Processing Made Simple"  .split(" ")val words = spark.sparkContext.parallelize(myCollection, 2)

%python# in PythonmyCollection = "Spark The Definitive Guide : Big Data Processing Made Simple"\  .split(" ")words = spark.sparkContext.parallelize(myCollection, 2)words

words.glom.collect 

%pythonwords.glom().collect()

Checkpointing from The Definitive Guide

One feature not available in the DataFrame API is the concept of checkpointing. Checkpointing is the act of saving an RDD to disk so that future references to this RDD point to those intermediate partitions on disk rather than recomputing the RDD from its original source. This is similar to caching except that it’s not stored in memory, only disk. This can be helpful when performing iterative computation, similar to the use cases for caching:

Let's create a directory in dbfs:/// for checkpointing of RDDs in the sequel. The following %fs mkdirs /path_to_dir is a shortcut to create a directory in dbfs:///

%fsmkdirs /datasets/ScaDaMaLe/checkpointing/

spark.sparkContext.setCheckpointDir("dbfs:///datasets/ScaDaMaLe/checkpointing")words.checkpoint()

006a_PipedRDD(Scala)

Piped RDDs and Bayesian AB Testing

YouTry