008_DiamondsPipeline_01ETLEDA

%md
### Diamonds ML Pipeline Workflow - DataFrame ETL and EDA Part

This is the Spark SQL parts that are focussed on extract-transform-Load (ETL) and exploratory-data-analysis (EDA) parts of an end-to-end example of a Machine Learning (ML) workflow.

**Why are we using DataFrames?** *This is because of the **Announcement** in the Spark MLlib Main Guide for Spark 2.2* [https://spark.apache.org/docs/latest/ml-guide.html](https://spark.apache.org/docs/latest/ml-guide.html) that *"DataFrame-based API is primary API"*.

This notebook is a scala*rific* break-down of the python*ic* 'Diamonds ML Pipeline Workflow' from the Databricks Guide.

**We will see this example again in the sequel**

For this example, we analyze the Diamonds dataset from the R Datasets hosted on DBC.  

Later on, we will use the [DecisionTree algorithm](http://spark.apache.org/docs/latest/ml-classification-regression.html#decision-trees) to predict the price of a diamond from its characteristics.

Here is an outline of our pipeline:

* **Step 1. *Load data*: Load data as DataFrame**
* **Step 2. *Understand the data*: Compute statistics and create visualizations to get a better understanding of the data.**
* Step 3. *Hold out data*: Split the data randomly into training and test sets.  We will not look at the test data until *after* learning.
* Step 4. On the training dataset:
  * *Extract features*: We will index categorical (String-valued) features so that DecisionTree can handle them.
  * *Learn a model*: Run DecisionTree to learn how to predict a diamond's price from a description of the diamond.
  * *Tune the model*: Tune the tree depth (complexity) using the training data.  (This process is also called *model selection*.)
* Step 5. *Evaluate the model*: Now look at the test dataset.  Compare the initial model with the tuned model to see the benefit of tuning parameters.
* Step 6. *Understand the model*: We will examine the learned model and results to gain further insight.

In this notebook, we will only cover **Step 1** and **Step 2.** above. The other Steps will be revisited in the sequel.

Diamonds ML Pipeline Workflow - DataFrame ETL and EDA Part

This is the Spark SQL parts that are focussed on extract-transform-Load (ETL) and exploratory-data-analysis (EDA) parts of an end-to-end example of a Machine Learning (ML) workflow.

Why are we using DataFrames? This is because of the Announcement in the Spark MLlib Main Guide for Spark 2.2 https://spark.apache.org/docs/latest/ml-guide.html that "DataFrame-based API is primary API".

This notebook is a scalarific break-down of the pythonic 'Diamonds ML Pipeline Workflow' from the Databricks Guide.

We will see this example again in the sequel

For this example, we analyze the Diamonds dataset from the R Datasets hosted on DBC.

Later on, we will use the DecisionTree algorithm to predict the price of a diamond from its characteristics.

Here is an outline of our pipeline:

Step 1. Load data: Load data as DataFrame
Step 2. Understand the data: Compute statistics and create visualizations to get a better understanding of the data.
Step 3. Hold out data: Split the data randomly into training and test sets. We will not look at the test data until after learning.
Step 4. On the training dataset:
- Extract features: We will index categorical (String-valued) features so that DecisionTree can handle them.
- Learn a model: Run DecisionTree to learn how to predict a diamond's price from a description of the diamond.
- Tune the model: Tune the tree depth (complexity) using the training data. (This process is also called model selection.)
Step 5. Evaluate the model: Now look at the test dataset. Compare the initial model with the tuned model to see the benefit of tuning parameters.
Step 6. Understand the model: We will examine the learned model and results to gain further insight.

In this notebook, we will only cover Step 1 and Step 2. above. The other Steps will be revisited in the sequel.

// We'll use the Diamonds dataset from the R datasets hosted on DBC.
val diamondsFilePath = "dbfs:/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv"

diamondsFilePath: String = dbfs:/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv

sc.textFile(diamondsFilePath).take(2) // looks like a csv file as it should

res2: Array[String] = Array("","carat","cut","color","clarity","depth","table","price","x","y","z", "1",0.23,"Ideal","E","SI2",61.5,55,326,3.95,3.98,2.43)

val diamondsRawDF = sqlContext.read    // we can use sqlContext instead of SparkSession for backwards compatibility to 1.x
    .format("com.databricks.spark.csv") // use spark.csv package
    .option("header", "true") // Use first line of all files as header
    .option("inferSchema", "true") // Automatically infer data types
    //.option("delimiter", ",") // Specify the delimiter as comma or ',' DEFAULT
    .load(diamondsFilePath)

diamondsRawDF: org.apache.spark.sql.DataFrame = [_c0: int, carat: double ... 9 more fields]

//There are 10 columns.  We will try to predict the price of diamonds, treating the other 9 columns as features.
diamondsRawDF.printSchema()

diamondsRawDF.count() // Ctrl+Enter

res4: Long = 53940

diamondsRawDF.show(10)

+---+-----+---------+-----+-------+-----+-----+-----+----+----+----+ |_c0|carat| cut|color|clarity|depth|table|price| x| y| z| +---+-----+---------+-----+-------+-----+-----+-----+----+----+----+ | 1| 0.23| Ideal| E| SI2| 61.5| 55.0| 326|3.95|3.98|2.43| | 2| 0.21| Premium| E| SI1| 59.8| 61.0| 326|3.89|3.84|2.31| | 3| 0.23| Good| E| VS1| 56.9| 65.0| 327|4.05|4.07|2.31| | 4| 0.29| Premium| I| VS2| 62.4| 58.0| 334| 4.2|4.23|2.63| | 5| 0.31| Good| J| SI2| 63.3| 58.0| 335|4.34|4.35|2.75| | 6| 0.24|Very Good| J| VVS2| 62.8| 57.0| 336|3.94|3.96|2.48| | 7| 0.24|Very Good| I| VVS1| 62.3| 57.0| 336|3.95|3.98|2.47| | 8| 0.26|Very Good| H| SI1| 61.9| 55.0| 337|4.07|4.11|2.53| | 9| 0.22| Fair| E| VS2| 65.1| 61.0| 337|3.87|3.78|2.49| | 10| 0.23|Very Good| H| VS1| 59.4| 61.0| 338| 4.0|4.05|2.39| +---+-----+---------+-----+-------+-----+-----+-----+----+----+----+ only showing top 10 rows

import org.apache.spark.sql.types.DoubleType
//we will convert price column from int to double for being able to model, fit and predict in downstream ML task
val diamondsDF = diamondsRawDF.select($"carat", $"cut", $"color", $"clarity", $"depth", $"table",$"price".cast(DoubleType).as("price"), $"x", $"y", $"z")
diamondsDF.cache() // let's cache it for reuse
diamondsDF.printSchema // print schema

root |-- carat: double (nullable = true) |-- cut: string (nullable = true) |-- color: string (nullable = true) |-- clarity: string (nullable = true) |-- depth: double (nullable = true) |-- table: double (nullable = true) |-- price: double (nullable = true) |-- x: double (nullable = true) |-- y: double (nullable = true) |-- z: double (nullable = true) import org.apache.spark.sql.types.DoubleType diamondsDF: org.apache.spark.sql.DataFrame = [carat: double, cut: string ... 8 more fields]

diamondsDF.show(10,false) // notice that price column has Double values that end in '.0' now

+-----+---------+-----+-------+-----+-----+-----+----+----+----+ |carat|cut |color|clarity|depth|table|price|x |y |z | +-----+---------+-----+-------+-----+-----+-----+----+----+----+ |0.23 |Ideal |E |SI2 |61.5 |55.0 |326.0|3.95|3.98|2.43| |0.21 |Premium |E |SI1 |59.8 |61.0 |326.0|3.89|3.84|2.31| |0.23 |Good |E |VS1 |56.9 |65.0 |327.0|4.05|4.07|2.31| |0.29 |Premium |I |VS2 |62.4 |58.0 |334.0|4.2 |4.23|2.63| |0.31 |Good |J |SI2 |63.3 |58.0 |335.0|4.34|4.35|2.75| |0.24 |Very Good|J |VVS2 |62.8 |57.0 |336.0|3.94|3.96|2.48| |0.24 |Very Good|I |VVS1 |62.3 |57.0 |336.0|3.95|3.98|2.47| |0.26 |Very Good|H |SI1 |61.9 |55.0 |337.0|4.07|4.11|2.53| |0.22 |Fair |E |VS2 |65.1 |61.0 |337.0|3.87|3.78|2.49| |0.23 |Very Good|H |VS1 |59.4 |61.0 |338.0|4.0 |4.05|2.39| +-----+---------+-----+-------+-----+-----+-----+----+----+----+ only showing top 10 rows

//View DataFrame in databricks
// note this 'display' is a databricks notebook specific command that is quite powerful for visual interaction with the data
// other notebooks like zeppelin have similar commands for interactive visualisation
display(diamondsDF)


0.23	Ideal	E	SI2	61.5	55	326	3.95	3.98	2.43
0.21	Premium	E	SI1	59.8	61	326	3.89	3.84	2.31
0.23	Good	E	VS1	56.9	65	327	4.05	4.07	2.31
0.29	Premium	I	VS2	62.4	58	334	4.2	4.23	2.63
0.31	Good	J	SI2	63.3	58	335	4.34	4.35	2.75
0.24	Very Good	J	VVS2	62.8	57	336	3.94	3.96	2.48
0.24	Very Good	I	VVS1	62.3	57	336	3.95	3.98	2.47
0.26	Very Good	H	SI1	61.9	55	337	4.07	4.11	2.53
0.22	Fair	E	VS2	65.1	61	337	3.87	3.78	2.49
0.23	Very Good	H	VS1	59.4	61	338	4	4.05	2.39
0.3	Good	J	SI1	64	55	339	4.25	4.28	2.73
0.23	Ideal	J	VS1	62.8	56	340	3.93	3.9	2.46
0.22	Premium	F	SI1	60.4	61	342	3.88	3.84	2.33
0.31	Ideal	J	SI2	62.2	54	344	4.35	4.37	2.71
0.2	Premium	E	SI2	60.2	62	345	3.79	3.75	2.27
0.32	Premium	E	I1	60.9	58	345	4.38	4.42	2.68
0.3	Ideal	I	SI2	62	54	348	4.31	4.34	2.68
0.3	Good	J	SI1	63.4	54	351	4.23	4.29	2.7
0.3	Good	J	SI1	63.8	56	351	4.23	4.26	2.71
0.3	Very Good	J	SI1	62.7	59	351	4.21	4.27	2.66
0.3	Good	I	SI2	63.3	56	351	4.26	4.3	2.71
0.23	Very Good	E	VS2	63.8	55	352	3.85	3.92	2.48
0.23	Very Good	H	VS1	61	57	353	3.94	3.96	2.41
0.31	Very Good	J	SI1	59.4	62	353	4.39	4.43	2.62
0.31	Very Good	J	SI1	58.1	62	353	4.44	4.47	2.59
0.23	Very Good	G	VVS2	60.4	58	354	3.97	4.01	2.41
0.24	Premium	I	VS1	62.5	57	355	3.97	3.94	2.47
0.3	Very Good	J	VS2	62.2	57	357	4.28	4.3	2.67
0.23	Very Good	D	VS2	60.5	61	357	3.96	3.97	2.4
0.23	Very Good	F	VS1	60.9	57	357	3.96	3.99	2.42
0.23	Very Good	F	VS1	60	57	402	4	4.03	2.41
0.23	Very Good	F	VS1	59.8	57	402	4.04	4.06	2.42
0.23	Very Good	E	VS1	60.7	59	402	3.97	4.01	2.42
0.23	Very Good	E	VS1	59.5	58	402	4.01	4.06	2.4
0.23	Very Good	D	VS1	61.9	58	402	3.92	3.96	2.44
0.23	Good	F	VS1	58.2	59	402	4.06	4.08	2.37
0.23	Good	E	VS1	64.1	59	402	3.83	3.85	2.46
0.31	Good	H	SI1	64	54	402	4.29	4.31	2.75
0.26	Very Good	D	VS2	60.8	59	403	4.13	4.16	2.52
0.33	Ideal	I	SI2	61.8	55	403	4.49	4.51	2.78
0.33	Ideal	I	SI2	61.2	56	403	4.49	4.5	2.75
0.33	Ideal	J	SI1	61.1	56	403	4.49	4.55	2.76
0.26	Good	D	VS2	65.2	56	403	3.99	4.02	2.61
0.26	Good	D	VS1	58.4	63	403	4.19	4.24	2.46
0.32	Good	H	SI2	63.1	56	403	4.34	4.37	2.75
0.29	Premium	F	SI1	62.4	58	403	4.24	4.26	2.65
0.32	Very Good	H	SI2	61.8	55	403	4.35	4.42	2.71
0.32	Good	H	SI2	63.8	56	403	4.36	4.38	2.79
0.25	Very Good	E	VS2	63.3	60	404	4	4.03	2.54
0.29	Very Good	H	SI2	60.7	60	404	4.33	4.37	2.64
0.24	Very Good	F	SI1	60.9	61	404	4.02	4.03	2.45
0.23	Ideal	G	VS1	61.9	54	404	3.93	3.95	2.44
0.32	Ideal	I	SI1	60.9	55	404	4.45	4.48	2.72
0.22	Premium	E	VS2	61.6	58	404	3.93	3.89	2.41
0.22	Premium	D	VS2	59.3	62	404	3.91	3.88	2.31
0.3	Ideal	I	SI2	61	59	405	4.3	4.33	2.63
0.3	Premium	J	SI2	59.3	61	405	4.43	4.38	2.61
0.3	Very Good	I	SI1	62.6	57	405	4.25	4.28	2.67
0.3	Very Good	I	SI1	63	57	405	4.28	4.32	2.71
0.3	Good	I	SI1	63.2	55	405	4.25	4.29	2.7
0.35	Ideal	I	VS1	60.9	57	552	4.54	4.59	2.78
0.3	Premium	D	SI1	62.6	59	552	4.23	4.27	2.66
0.3	Ideal	D	SI1	62.5	57	552	4.29	4.32	2.69
0.3	Ideal	D	SI1	62.1	56	552	4.3	4.33	2.68
0.42	Premium	I	SI2	61.5	59	552	4.78	4.84	2.96
0.28	Ideal	G	VVS2	61.4	56	553	4.19	4.22	2.58
0.32	Ideal	I	VVS1	62	55.3	553	4.39	4.42	2.73
0.31	Very Good	G	SI1	63.3	57	553	4.33	4.3	2.73
0.31	Premium	G	SI1	61.8	58	553	4.35	4.32	2.68
0.24	Premium	E	VVS1	60.7	58	553	4.01	4.03	2.44
0.24	Very Good	D	VVS1	61.5	60	553	3.97	4	2.45
0.3	Very Good	H	SI1	63.1	56	554	4.29	4.27	2.7
0.3	Premium	H	SI1	62.9	59	554	4.28	4.24	2.68
0.3	Premium	H	SI1	62.5	57	554	4.29	4.25	2.67
0.3	Good	H	SI1	63.7	57	554	4.28	4.26	2.72
0.26	Very Good	F	VVS2	59.2	60	554	4.19	4.22	2.49
0.26	Very Good	E	VVS2	59.9	58	554	4.15	4.23	2.51
0.26	Very Good	D	VVS2	62.4	54	554	4.08	4.13	2.56
0.26	Very Good	D	VVS2	62.8	60	554	4.01	4.05	2.53
0.26	Very Good	E	VVS1	62.6	59	554	4.06	4.09	2.55
0.26	Very Good	E	VVS1	63.4	59	554	4	4.04	2.55
0.26	Very Good	D	VVS1	62.1	60	554	4.03	4.12	2.53
0.26	Ideal	E	VVS2	62.9	58	554	4.02	4.06	2.54
0.38	Ideal	I	SI2	61.6	56	554	4.65	4.67	2.87
0.26	Good	E	VVS1	57.9	60	554	4.22	4.25	2.45
0.24	Premium	G	VVS1	62.3	59	554	3.95	3.92	2.45
0.24	Premium	H	VVS1	61.2	58	554	4.01	3.96	2.44
0.24	Premium	H	VVS1	60.8	59	554	4.02	4	2.44
0.24	Premium	H	VVS2	60.7	58	554	4.07	4.04	2.46
0.32	Premium	I	SI1	62.9	58	554	4.35	4.33	2.73
0.7	Ideal	E	SI1	62.5	57	2757	5.7	5.72	3.57
0.86	Fair	E	SI2	55.1	69	2757	6.45	6.33	3.52
0.7	Ideal	G	VS2	61.6	56	2757	5.7	5.67	3.5
0.71	Very Good	E	VS2	62.4	57	2759	5.68	5.73	3.56
0.78	Very Good	G	SI2	63.8	56	2759	5.81	5.85	3.72
0.7	Good	E	VS2	57.5	58	2759	5.85	5.9	3.38
0.7	Good	F	VS1	59.4	62	2759	5.71	5.76	3.4
0.96	Fair	F	SI2	66.3	62	2759	6.27	5.95	4.07
0.73	Very Good	E	SI1	61.6	59	2760	5.77	5.78	3.56
0.8	Premium	H	SI1	61.5	58	2760	5.97	5.93	3.66

Showing the first 1000 rows.

val cutsDF = diamondsDF.select("cut") // Shift+Enter

cutsDF: org.apache.spark.sql.DataFrame = [cut: string]

cutsDF.show(10) // Ctrl+Enter

// View distinct diamond cuts in dataset
val cutsDistinctDF = diamondsDF.select("cut").distinct()

cutsDistinctDF: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [cut: string]

cutsDistinctDF.show()

+---------+ | cut| +---------+ | Premium| | Ideal| | Good| | Fair| |Very Good| +---------+

SDS-2.2, Scalable Data Science

Diamonds ML Pipeline Workflow - DataFrame ETL and EDA Part

Step 1. Load data as DataFrame

Step 2. Understand the data

Let's first look at the categorical features.