%md
Here we will first take excerpts with minor modifications from the end of **Chapter 12. Resilient Distributed Datasets (RDDs)** of *Spark: The Definitive Guide*:
- https://learning.oreilly.com/library/view/spark-the-definitive/9781491912201/ch12.html
Next, we will do Bayesian AB Testing using PipedRDDs.
Here we will first take excerpts with minor modifications from the end of Chapter 12. Resilient Distributed Datasets (RDDs) of Spark: The Definitive Guide:
Next, we will do Bayesian AB Testing using PipedRDDs.
%md
First, we create the toy RDDs as in *The Definitive Guide*:
> **From a Local Collection**
> To create an RDD from a collection, you will need to use the parallelize method on a SparkContext (within a SparkSession). This turns a single node collection into a parallel collection. When creating this parallel collection, you can also explicitly state the number of partitions into which you would like to distribute this array. In this case, we are creating two partitions:
First, we create the toy RDDs as in The Definitive Guide:
From a Local Collection
To create an RDD from a collection, you will need to use the parallelize method on a SparkContext (within a SparkSession). This turns a single node collection into a parallel collection. When creating this parallel collection, you can also explicitly state the number of partitions into which you would like to distribute this array. In this case, we are creating two partitions:
%md
> **glom** from *The Definitive Guide*
> `glom` is an interesting function that takes every partition in your dataset and converts them to arrays. This can be useful if you’re going to collect the data to the driver and want to have an array for each partition. However, this can cause serious stability issues because if you have large partitions or a large number of partitions, it’s simple to crash the driver.
glom from The Definitive Guide
glom
is an interesting function that takes every partition in your dataset and converts them to arrays. This can be useful if you’re going to collect the data to the driver and want to have an array for each partition. However, this can cause serious stability issues because if you have large partitions or a large number of partitions, it’s simple to crash the driver.
%md
> **Checkpointing** from *The Definitive Guide*
> One feature not available in the DataFrame API is the concept of checkpointing. Checkpointing is the act of saving an RDD to disk so that future references to this RDD point to those intermediate partitions on disk rather than recomputing the RDD from its original source. This is similar to caching except that it’s not stored in memory, only disk. This can be helpful when performing iterative computation, similar to the use cases for caching:
Let's create a directory in `dbfs:///` for checkpointing of RDDs in the sequel. The following `%fs mkdirs /path_to_dir` is a shortcut to create a directory in `dbfs:///`
Checkpointing from The Definitive Guide
One feature not available in the DataFrame API is the concept of checkpointing. Checkpointing is the act of saving an RDD to disk so that future references to this RDD point to those intermediate partitions on disk rather than recomputing the RDD from its original source. This is similar to caching except that it’s not stored in memory, only disk. This can be helpful when performing iterative computation, similar to the use cases for caching:
Let's create a directory in dbfs:///
for checkpointing of RDDs in the sequel. The following %fs mkdirs /path_to_dir
is a shortcut to create a directory in dbfs:///
ScaDaMaLe Course site and book