008_DiamondsPipeline_01ETLEDA

%md
### Diamonds ML Pipeline Workflow - DataFrame ETL and EDA Part

This is the Spark SQL parts that are focussed on extract-transform-Load (ETL) and exploratory-data-analysis (EDA) parts of an end-to-end example of a Machine Learning (ML) workflow.

**Why are we using DataFrames?** *This is because of the **Announcement** in the Spark MLlib Main Guide for Spark 2.2* [https://spark.apache.org/docs/latest/ml-guide.html](https://spark.apache.org/docs/latest/ml-guide.html) that *"DataFrame-based API is primary API"*.

This notebook is a scala*rific* break-down of the python*ic* 'Diamonds ML Pipeline Workflow' from the Databricks Guide.

**We will see this example again in the sequel**

For this example, we analyze the Diamonds dataset from the R Datasets hosted on DBC.  

Later on, we will use the [DecisionTree algorithm](http://spark.apache.org/docs/latest/ml-classification-regression.html#decision-trees) to predict the price of a diamond from its characteristics.

Here is an outline of our pipeline:

* **Step 1. *Load data*: Load data as DataFrame**
* **Step 2. *Understand the data*: Compute statistics and create visualizations to get a better understanding of the data.**
* Step 3. *Hold out data*: Split the data randomly into training and test sets.  We will not look at the test data until *after* learning.
* Step 4. On the training dataset:
  * *Extract features*: We will index categorical (String-valued) features so that DecisionTree can handle them.
  * *Learn a model*: Run DecisionTree to learn how to predict a diamond's price from a description of the diamond.
  * *Tune the model*: Tune the tree depth (complexity) using the training data.  (This process is also called *model selection*.)
* Step 5. *Evaluate the model*: Now look at the test dataset.  Compare the initial model with the tuned model to see the benefit of tuning parameters.
* Step 6. *Understand the model*: We will examine the learned model and results to gain further insight.

In this notebook, we will only cover **Step 1** and **Step 2.** above. The other Steps will be revisited in the sequel.

Diamonds ML Pipeline Workflow - DataFrame ETL and EDA Part

This is the Spark SQL parts that are focussed on extract-transform-Load (ETL) and exploratory-data-analysis (EDA) parts of an end-to-end example of a Machine Learning (ML) workflow.

Why are we using DataFrames? This is because of the Announcement in the Spark MLlib Main Guide for Spark 2.2 https://spark.apache.org/docs/latest/ml-guide.html that "DataFrame-based API is primary API".

This notebook is a scalarific break-down of the pythonic 'Diamonds ML Pipeline Workflow' from the Databricks Guide.

We will see this example again in the sequel

For this example, we analyze the Diamonds dataset from the R Datasets hosted on DBC.

Later on, we will use the DecisionTree algorithm to predict the price of a diamond from its characteristics.

Here is an outline of our pipeline:

Step 1. Load data: Load data as DataFrame
Step 2. Understand the data: Compute statistics and create visualizations to get a better understanding of the data.
Step 3. Hold out data: Split the data randomly into training and test sets. We will not look at the test data until after learning.
Step 4. On the training dataset:
- Extract features: We will index categorical (String-valued) features so that DecisionTree can handle them.
- Learn a model: Run DecisionTree to learn how to predict a diamond's price from a description of the diamond.
- Tune the model: Tune the tree depth (complexity) using the training data. (This process is also called model selection.)
Step 5. Evaluate the model: Now look at the test dataset. Compare the initial model with the tuned model to see the benefit of tuning parameters.
Step 6. Understand the model: We will examine the learned model and results to gain further insight.

In this notebook, we will only cover Step 1 and Step 2. above. The other Steps will be revisited in the sequel.

// We'll use the Diamonds dataset from the R datasets hosted on DBC.
val diamondsFilePath = "dbfs:/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv"

diamondsFilePath: String = dbfs:/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv

sc.textFile(diamondsFilePath).take(2) // looks like a csv file as it should

res46: Array[String] = Array("","carat","cut","color","clarity","depth","table","price","x","y","z", "1",0.23,"Ideal","E","SI2",61.5,55,326,3.95,3.98,2.43)

val diamondsRawDF = sqlContext.read    // we can use sqlContext instead of SparkSession for backwards compatibility to 1.x
    .format("com.databricks.spark.csv") // use spark.csv package
    .option("header", "true") // Use first line of all files as header
    .option("inferSchema", "true") // Automatically infer data types
    //.option("delimiter", ",") // Specify the delimiter as comma or ',' DEFAULT
    .load(diamondsFilePath)

diamondsRawDF: org.apache.spark.sql.DataFrame = [_c0: int, carat: double ... 9 more fields]

//There are 10 columns.  We will try to predict the price of diamonds, treating the other 9 columns as features.
diamondsRawDF.printSchema()

diamondsRawDF.count() // Ctrl+Enter

res49: Long = 53940

diamondsRawDF.show(10)

+---+-----+---------+-----+-------+-----+-----+-----+----+----+----+ |_c0|carat| cut|color|clarity|depth|table|price| x| y| z| +---+-----+---------+-----+-------+-----+-----+-----+----+----+----+ | 1| 0.23| Ideal| E| SI2| 61.5| 55.0| 326|3.95|3.98|2.43| | 2| 0.21| Premium| E| SI1| 59.8| 61.0| 326|3.89|3.84|2.31| | 3| 0.23| Good| E| VS1| 56.9| 65.0| 327|4.05|4.07|2.31| | 4| 0.29| Premium| I| VS2| 62.4| 58.0| 334| 4.2|4.23|2.63| | 5| 0.31| Good| J| SI2| 63.3| 58.0| 335|4.34|4.35|2.75| | 6| 0.24|Very Good| J| VVS2| 62.8| 57.0| 336|3.94|3.96|2.48| | 7| 0.24|Very Good| I| VVS1| 62.3| 57.0| 336|3.95|3.98|2.47| | 8| 0.26|Very Good| H| SI1| 61.9| 55.0| 337|4.07|4.11|2.53| | 9| 0.22| Fair| E| VS2| 65.1| 61.0| 337|3.87|3.78|2.49| | 10| 0.23|Very Good| H| VS1| 59.4| 61.0| 338| 4.0|4.05|2.39| +---+-----+---------+-----+-------+-----+-----+-----+----+----+----+ only showing top 10 rows

import org.apache.spark.sql.types.DoubleType
//we will convert price column from int to double for being able to model, fit and predict in downstream ML task
val diamondsDF = diamondsRawDF.select($"carat", $"cut", $"color", $"clarity", $"depth", $"table",$"price".cast(DoubleType).as("price"), $"x", $"y", $"z")
diamondsDF.cache() // let's cache it for reuse
diamondsDF.printSchema // print schema

root |-- carat: double (nullable = true) |-- cut: string (nullable = true) |-- color: string (nullable = true) |-- clarity: string (nullable = true) |-- depth: double (nullable = true) |-- table: double (nullable = true) |-- price: double (nullable = true) |-- x: double (nullable = true) |-- y: double (nullable = true) |-- z: double (nullable = true) import org.apache.spark.sql.types.DoubleType diamondsDF: org.apache.spark.sql.DataFrame = [carat: double, cut: string ... 8 more fields]

diamondsDF.show(10,false) // notice that price column has Double values that end in '.0' now

+-----+---------+-----+-------+-----+-----+-----+----+----+----+ |carat|cut |color|clarity|depth|table|price|x |y |z | +-----+---------+-----+-------+-----+-----+-----+----+----+----+ |0.23 |Ideal |E |SI2 |61.5 |55.0 |326.0|3.95|3.98|2.43| |0.21 |Premium |E |SI1 |59.8 |61.0 |326.0|3.89|3.84|2.31| |0.23 |Good |E |VS1 |56.9 |65.0 |327.0|4.05|4.07|2.31| |0.29 |Premium |I |VS2 |62.4 |58.0 |334.0|4.2 |4.23|2.63| |0.31 |Good |J |SI2 |63.3 |58.0 |335.0|4.34|4.35|2.75| |0.24 |Very Good|J |VVS2 |62.8 |57.0 |336.0|3.94|3.96|2.48| |0.24 |Very Good|I |VVS1 |62.3 |57.0 |336.0|3.95|3.98|2.47| |0.26 |Very Good|H |SI1 |61.9 |55.0 |337.0|4.07|4.11|2.53| |0.22 |Fair |E |VS2 |65.1 |61.0 |337.0|3.87|3.78|2.49| |0.23 |Very Good|H |VS1 |59.4 |61.0 |338.0|4.0 |4.05|2.39| +-----+---------+-----+-------+-----+-----+-----+----+----+----+ only showing top 10 rows

//View DataFrame in databricks
// note this 'display' is a databricks notebook specific command that is quite powerful for visual interaction with the data
// other notebooks like zeppelin have similar commands for interactive visualisation
display(diamondsDF)


0.23	Ideal	E	SI2	61.5	55	326	3.95	3.98	2.43
0.21	Premium	E	SI1	59.8	61	326	3.89	3.84	2.31
0.23	Good	E	VS1	56.9	65	327	4.05	4.07	2.31
0.29	Premium	I	VS2	62.4	58	334	4.2	4.23	2.63
0.31	Good	J	SI2	63.3	58	335	4.34	4.35	2.75
0.24	Very Good	J	VVS2	62.8	57	336	3.94	3.96	2.48
0.24	Very Good	I	VVS1	62.3	57	336	3.95	3.98	2.47
0.26	Very Good	H	SI1	61.9	55	337	4.07	4.11	2.53
0.22	Fair	E	VS2	65.1	61	337	3.87	3.78	2.49
0.23	Very Good	H	VS1	59.4	61	338	4	4.05	2.39
0.3	Good	J	SI1	64	55	339	4.25	4.28	2.73
0.23	Ideal	J	VS1	62.8	56	340	3.93	3.9	2.46
0.22	Premium	F	SI1	60.4	61	342	3.88	3.84	2.33
0.31	Ideal	J	SI2	62.2	54	344	4.35	4.37	2.71
0.2	Premium	E	SI2	60.2	62	345	3.79	3.75	2.27
0.32	Premium	E	I1	60.9	58	345	4.38	4.42	2.68
0.3	Ideal	I	SI2	62	54	348	4.31	4.34	2.68
0.3	Good	J	SI1	63.4	54	351	4.23	4.29	2.7
0.3	Good	J	SI1	63.8	56	351	4.23	4.26	2.71
0.3	Very Good	J	SI1	62.7	59	351	4.21	4.27	2.66
0.3	Good	I	SI2	63.3	56	351	4.26	4.3	2.71
0.23	Very Good	E	VS2	63.8	55	352	3.85	3.92	2.48
0.23	Very Good	H	VS1	61	57	353	3.94	3.96	2.41
0.31	Very Good	J	SI1	59.4	62	353	4.39	4.43	2.62
0.31	Very Good	J	SI1	58.1	62	353	4.44	4.47	2.59
0.23	Very Good	G	VVS2	60.4	58	354	3.97	4.01	2.41
0.24	Premium	I	VS1	62.5	57	355	3.97	3.94	2.47
0.3	Very Good	J	VS2	62.2	57	357	4.28	4.3	2.67
0.23	Very Good	D	VS2	60.5	61	357	3.96	3.97	2.4
0.23	Very Good	F	VS1	60.9	57	357	3.96	3.99	2.42
0.23	Very Good	F	VS1	60	57	402	4	4.03	2.41
0.23	Very Good	F	VS1	59.8	57	402	4.04	4.06	2.42
0.23	Very Good	E	VS1	60.7	59	402	3.97	4.01	2.42
0.23	Very Good	E	VS1	59.5	58	402	4.01	4.06	2.4
0.23	Very Good	D	VS1	61.9	58	402	3.92	3.96	2.44
0.23	Good	F	VS1	58.2	59	402	4.06	4.08	2.37
0.23	Good	E	VS1	64.1	59	402	3.83	3.85	2.46
0.31	Good	H	SI1	64	54	402	4.29	4.31	2.75
0.26	Very Good	D	VS2	60.8	59	403	4.13	4.16	2.52
0.33	Ideal	I	SI2	61.8	55	403	4.49	4.51	2.78
0.33	Ideal	I	SI2	61.2	56	403	4.49	4.5	2.75
0.33	Ideal	J	SI1	61.1	56	403	4.49	4.55	2.76
0.26	Good	D	VS2	65.2	56	403	3.99	4.02	2.61
0.26	Good	D	VS1	58.4	63	403	4.19	4.24	2.46
0.32	Good	H	SI2	63.1	56	403	4.34	4.37	2.75
0.29	Premium	F	SI1	62.4	58	403	4.24	4.26	2.65
0.32	Very Good	H	SI2	61.8	55	403	4.35	4.42	2.71
0.32	Good	H	SI2	63.8	56	403	4.36	4.38	2.79
0.25	Very Good	E	VS2	63.3	60	404	4	4.03	2.54
0.29	Very Good	H	SI2	60.7	60	404	4.33	4.37	2.64
0.24	Very Good	F	SI1	60.9	61	404	4.02	4.03	2.45
0.23	Ideal	G	VS1	61.9	54	404	3.93	3.95	2.44
0.32	Ideal	I	SI1	60.9	55	404	4.45	4.48	2.72
0.22	Premium	E	VS2	61.6	58	404	3.93	3.89	2.41
0.22	Premium	D	VS2	59.3	62	404	3.91	3.88	2.31
0.3	Ideal	I	SI2	61	59	405	4.3	4.33	2.63
0.3	Premium	J	SI2	59.3	61	405	4.43	4.38	2.61
0.3	Very Good	I	SI1	62.6	57	405	4.25	4.28	2.67
0.3	Very Good	I	SI1	63	57	405	4.28	4.32	2.71
0.3	Good	I	SI1	63.2	55	405	4.25	4.29	2.7
0.35	Ideal	I	VS1	60.9	57	552	4.54	4.59	2.78
0.3	Premium	D	SI1	62.6	59	552	4.23	4.27	2.66
0.3	Ideal	D	SI1	62.5	57	552	4.29	4.32	2.69
0.3	Ideal	D	SI1	62.1	56	552	4.3	4.33	2.68
0.42	Premium	I	SI2	61.5	59	552	4.78	4.84	2.96
0.28	Ideal	G	VVS2	61.4	56	553	4.19	4.22	2.58
0.32	Ideal	I	VVS1	62	55.3	553	4.39	4.42	2.73
0.31	Very Good	G	SI1	63.3	57	553	4.33	4.3	2.73
0.31	Premium	G	SI1	61.8	58	553	4.35	4.32	2.68
0.24	Premium	E	VVS1	60.7	58	553	4.01	4.03	2.44
0.24	Very Good	D	VVS1	61.5	60	553	3.97	4	2.45
0.3	Very Good	H	SI1	63.1	56	554	4.29	4.27	2.7
0.3	Premium	H	SI1	62.9	59	554	4.28	4.24	2.68
0.3	Premium	H	SI1	62.5	57	554	4.29	4.25	2.67
0.3	Good	H	SI1	63.7	57	554	4.28	4.26	2.72
0.26	Very Good	F	VVS2	59.2	60	554	4.19	4.22	2.49
0.26	Very Good	E	VVS2	59.9	58	554	4.15	4.23	2.51
0.26	Very Good	D	VVS2	62.4	54	554	4.08	4.13	2.56
0.26	Very Good	D	VVS2	62.8	60	554	4.01	4.05	2.53
0.26	Very Good	E	VVS1	62.6	59	554	4.06	4.09	2.55
0.26	Very Good	E	VVS1	63.4	59	554	4	4.04	2.55
0.26	Very Good	D	VVS1	62.1	60	554	4.03	4.12	2.53
0.26	Ideal	E	VVS2	62.9	58	554	4.02	4.06	2.54
0.38	Ideal	I	SI2	61.6	56	554	4.65	4.67	2.87
0.26	Good	E	VVS1	57.9	60	554	4.22	4.25	2.45
0.24	Premium	G	VVS1	62.3	59	554	3.95	3.92	2.45
0.24	Premium	H	VVS1	61.2	58	554	4.01	3.96	2.44
0.24	Premium	H	VVS1	60.8	59	554	4.02	4	2.44
0.24	Premium	H	VVS2	60.7	58	554	4.07	4.04	2.46
0.32	Premium	I	SI1	62.9	58	554	4.35	4.33	2.73
0.7	Ideal	E	SI1	62.5	57	2757	5.7	5.72	3.57
0.86	Fair	E	SI2	55.1	69	2757	6.45	6.33	3.52
0.7	Ideal	G	VS2	61.6	56	2757	5.7	5.67	3.5
0.71	Very Good	E	VS2	62.4	57	2759	5.68	5.73	3.56
0.78	Very Good	G	SI2	63.8	56	2759	5.81	5.85	3.72
0.7	Good	E	VS2	57.5	58	2759	5.85	5.9	3.38
0.7	Good	F	VS1	59.4	62	2759	5.71	5.76	3.4
0.96	Fair	F	SI2	66.3	62	2759	6.27	5.95	4.07
0.73	Very Good	E	SI1	61.6	59	2760	5.77	5.78	3.56
0.8	Premium	H	SI1	61.5	58	2760	5.97	5.93	3.66

Showing the first 1000 rows.

val cutsDF = diamondsDF.select("cut") // Shift+Enter

cutsDF: org.apache.spark.sql.DataFrame = [cut: string]

cutsDF.show(10) // Ctrl+Enter

// View distinct diamond cuts in dataset
val cutsDistinctDF = diamondsDF.select("cut").distinct()

cutsDistinctDF: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [cut: string]

cutsDistinctDF.show()

+---------+ | cut| +---------+ | Premium| | Ideal| | Good| | Fair| |Very Good| +---------+

// View distinct diamond colors in dataset
val colorsDistinctDF = diamondsDF.select("color").distinct() //.collect()
colorsDistinctDF.show()

+-----+ |color| +-----+ | F| | E| | D| | J| | G| | I| | H| +-----+ colorsDistinctDF: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [color: string]

// View distinct diamond clarities in dataset
val claritiesDistinctDF = diamondsDF.select("clarity").distinct() // .collect()
claritiesDistinctDF.show()

+-------+ |clarity| +-------+ | VVS2| | SI1| | IF| | I1| | VVS1| | VS2| | SI2| | VS1| +-------+ claritiesDistinctDF: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [clarity: string]

display(diamondsDF.select("cut"))

// come on do the same for color NOW!

// and clarity too...

 display(diamondsDF)

//Select: "Plot Options..." --> "Display type" --> "histogram plot" and choose to "Plot over all results" OTHERWISE you get the image from first 1000 rows only
display(diamondsDF.select("carat"))

The above histogram of the diamonds' carat ratings shows that carats have a skewed distribution: Many diamonds are small, but there are a number of diamonds in the dataset which are much larger.

Extremely skewed distributions can cause problems for some algorithms (e.g., Linear Regression).
However, Decision Trees handle skewed distributions very naturally.

Note: When you call display to create a histogram like that above, it will plot using a subsample from the dataset (for efficiency), but you can plot using the full dataset by selecting "Plot over all results". For our dataset, the two plots can actually look very different due to the long-tailed distribution.

We will not examine the label distribution for now. It can be helpful to examine the label distribution, but it is best to do so only on the training set, not on the test set which we will hold out for evaluation. These will be seen in the sequel

display(diamondsDF.select("cut","carat"))

display(diamondsDF) //Ctrl+Enter

diamondsDF.printSchema // Ctrl+Enter

val diamondsDColoredDF = diamondsDF.select("carat", "color", "price").filter($"color" === "D") // Shift+Enter

diamondsDColoredDF: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [carat: double, color: string ... 1 more field]

diamondsDColoredDF.show(10) // Ctrl+Enter

+-----+-----+-----+ |carat|color|price| +-----+-----+-----+ | 0.23| D|357.0| | 0.23| D|402.0| | 0.26| D|403.0| | 0.26| D|403.0| | 0.26| D|403.0| | 0.22| D|404.0| | 0.3| D|552.0| | 0.3| D|552.0| | 0.3| D|552.0| | 0.24| D|553.0| +-----+-----+-----+ only showing top 10 rows

diamondsDColoredDF.select("color").distinct().show

+-----+ |color| +-----+ | D| +-----+

sqlContext.tables.show() // Ctrl+Enter to see available tables

diamondsDF.createOrReplaceTempView("diamonds")

sqlContext.tables.show() // Ctrl+Enter to see available tables

%sql -- Shift+Enter to do the same in SQL
select carat, color, price from diamonds where color='D'


0.23	D	357
0.23	D	402
0.26	D	403
0.26	D	403
0.26	D	403
0.22	D	404
0.3	D	552
0.3	D	552
0.3	D	552
0.24	D	553
0.26	D	554
0.26	D	554
0.26	D	554
0.75	D	2760
0.71	D	2762
0.61	D	2763
0.71	D	2764
0.71	D	2764
0.7	D	2767
0.71	D	2767
0.73	D	2768
0.7	D	2768
0.71	D	2768
0.71	D	2770
0.76	D	2770
0.73	D	2770
0.75	D	2773
0.7	D	2773
0.7	D	2777
0.53	D	2782
0.75	D	2782
0.72	D	2782
0.72	D	2782
0.7	D	2782
0.64	D	2787
0.71	D	2788
0.72	D	2795
0.71	D	2797
0.71	D	2797
0.71	D	2797
0.51	D	2797
0.78	D	2799
0.91	D	2803
0.7	D	2804
0.7	D	2804
0.72	D	2804
0.72	D	2804
0.73	D	2808
0.81	D	2809
0.74	D	2810
0.83	D	2811
0.71	D	2812
0.55	D	2815
0.71	D	2816
0.73	D	2821
0.71	D	2822
0.71	D	2822
0.7	D	2822
0.7	D	2822
0.71	D	2822
0.7	D	2822
0.7	D	2822
0.7	D	2822
0.7	D	2822
0.79	D	2823
0.71	D	2824
0.7	D	2826
0.7	D	2827
0.72	D	2827
0.7	D	2828
0.7	D	2833
0.7	D	2833
0.51	D	2834
0.92	D	2840
0.71	D	2841
0.73	D	2841
0.73	D	2841
0.71	D	2843
0.79	D	2846
0.76	D	2847
0.54	D	2848
0.75	D	2848
0.66	D	2851
0.79	D	2853
0.79	D	2853
0.74	D	2855
0.73	D	2858
0.71	D	2858
0.71	D	2858
0.7	D	2859
0.7	D	2859
0.7	D	2859
0.71	D	2860
0.71	D	2861
0.66	D	2861
0.7	D	2862
0.8	D	2862
0.71	D	2863
0.71	D	2863
0.71	D	2863

Showing the first 1000 rows.

val diamondsDColoredDF_FromTable = sqlContext.sql("select carat, color, price from diamonds where color='D'") // Shift+Enter

diamondsDColoredDF_FromTable: org.apache.spark.sql.DataFrame = [carat: double, color: string ... 1 more field]

// or if you like use upper case for SQL then this is equivalent
val diamondsDColoredDF_FromTable = sqlContext.sql("SELECT carat, color, price FROM diamonds WHERE color='D'") // Shift+Enter

diamondsDColoredDF_FromTable: org.apache.spark.sql.DataFrame = [carat: double, color: string ... 1 more field]

// from version 2.x onwards you can call from SparkSession, the pre-made spark in spark-shell or databricks notebook
val diamondsDColoredDF_FromTable = spark.sql("SELECT carat, color, price FROM diamonds WHERE color='D'") // Shift+Enter

diamondsDColoredDF_FromTable: org.apache.spark.sql.DataFrame = [carat: double, color: string ... 1 more field]

display(diamondsDColoredDF_FromTable) // Ctrl+Enter to see the same DF!


0.23	D	357
0.23	D	402
0.26	D	403
0.26	D	403
0.26	D	403
0.22	D	404
0.3	D	552
0.3	D	552
0.3	D	552
0.24	D	553
0.26	D	554
0.26	D	554
0.26	D	554
0.75	D	2760
0.71	D	2762
0.61	D	2763
0.71	D	2764
0.71	D	2764
0.7	D	2767
0.71	D	2767
0.73	D	2768
0.7	D	2768
0.71	D	2768
0.71	D	2770
0.76	D	2770
0.73	D	2770
0.75	D	2773
0.7	D	2773
0.7	D	2777
0.53	D	2782
0.75	D	2782
0.72	D	2782
0.72	D	2782
0.7	D	2782
0.64	D	2787
0.71	D	2788
0.72	D	2795
0.71	D	2797
0.71	D	2797
0.71	D	2797
0.51	D	2797
0.78	D	2799
0.91	D	2803
0.7	D	2804
0.7	D	2804
0.72	D	2804
0.72	D	2804
0.73	D	2808
0.81	D	2809
0.74	D	2810
0.83	D	2811
0.71	D	2812
0.55	D	2815
0.71	D	2816
0.73	D	2821
0.71	D	2822
0.71	D	2822
0.7	D	2822
0.7	D	2822
0.71	D	2822
0.7	D	2822
0.7	D	2822
0.7	D	2822
0.7	D	2822
0.79	D	2823
0.71	D	2824
0.7	D	2826
0.7	D	2827
0.72	D	2827
0.7	D	2828
0.7	D	2833
0.7	D	2833
0.51	D	2834
0.92	D	2840
0.71	D	2841
0.73	D	2841
0.73	D	2841
0.71	D	2843
0.79	D	2846
0.76	D	2847
0.54	D	2848
0.75	D	2848
0.66	D	2851
0.79	D	2853
0.79	D	2853
0.74	D	2855
0.73	D	2858
0.71	D	2858
0.71	D	2858
0.7	D	2859
0.7	D	2859
0.7	D	2859
0.71	D	2860
0.71	D	2861
0.66	D	2861
0.7	D	2862
0.8	D	2862
0.71	D	2863
0.71	D	2863
0.71	D	2863

Showing the first 1000 rows.

// You can also use the familiar wildchard character '%' when matching Strings
display(spark.sql("SELECT * FROM diamonds WHERE clarity LIKE 'V%'"))


0.23	Good	E	VS1	56.9	65	327	4.05	4.07	2.31
0.29	Premium	I	VS2	62.4	58	334	4.2	4.23	2.63
0.24	Very Good	J	VVS2	62.8	57	336	3.94	3.96	2.48
0.24	Very Good	I	VVS1	62.3	57	336	3.95	3.98	2.47
0.22	Fair	E	VS2	65.1	61	337	3.87	3.78	2.49
0.23	Very Good	H	VS1	59.4	61	338	4	4.05	2.39
0.23	Ideal	J	VS1	62.8	56	340	3.93	3.9	2.46
0.23	Very Good	E	VS2	63.8	55	352	3.85	3.92	2.48
0.23	Very Good	H	VS1	61	57	353	3.94	3.96	2.41
0.23	Very Good	G	VVS2	60.4	58	354	3.97	4.01	2.41
0.24	Premium	I	VS1	62.5	57	355	3.97	3.94	2.47
0.3	Very Good	J	VS2	62.2	57	357	4.28	4.3	2.67
0.23	Very Good	D	VS2	60.5	61	357	3.96	3.97	2.4
0.23	Very Good	F	VS1	60.9	57	357	3.96	3.99	2.42
0.23	Very Good	F	VS1	60	57	402	4	4.03	2.41
0.23	Very Good	F	VS1	59.8	57	402	4.04	4.06	2.42
0.23	Very Good	E	VS1	60.7	59	402	3.97	4.01	2.42
0.23	Very Good	E	VS1	59.5	58	402	4.01	4.06	2.4
0.23	Very Good	D	VS1	61.9	58	402	3.92	3.96	2.44
0.23	Good	F	VS1	58.2	59	402	4.06	4.08	2.37
0.23	Good	E	VS1	64.1	59	402	3.83	3.85	2.46
0.26	Very Good	D	VS2	60.8	59	403	4.13	4.16	2.52
0.26	Good	D	VS2	65.2	56	403	3.99	4.02	2.61
0.26	Good	D	VS1	58.4	63	403	4.19	4.24	2.46
0.25	Very Good	E	VS2	63.3	60	404	4	4.03	2.54
0.23	Ideal	G	VS1	61.9	54	404	3.93	3.95	2.44
0.22	Premium	E	VS2	61.6	58	404	3.93	3.89	2.41
0.22	Premium	D	VS2	59.3	62	404	3.91	3.88	2.31
0.35	Ideal	I	VS1	60.9	57	552	4.54	4.59	2.78
0.28	Ideal	G	VVS2	61.4	56	553	4.19	4.22	2.58
0.32	Ideal	I	VVS1	62	55.3	553	4.39	4.42	2.73
0.24	Premium	E	VVS1	60.7	58	553	4.01	4.03	2.44
0.24	Very Good	D	VVS1	61.5	60	553	3.97	4	2.45
0.26	Very Good	F	VVS2	59.2	60	554	4.19	4.22	2.49
0.26	Very Good	E	VVS2	59.9	58	554	4.15	4.23	2.51
0.26	Very Good	D	VVS2	62.4	54	554	4.08	4.13	2.56
0.26	Very Good	D	VVS2	62.8	60	554	4.01	4.05	2.53
0.26	Very Good	E	VVS1	62.6	59	554	4.06	4.09	2.55
0.26	Very Good	E	VVS1	63.4	59	554	4	4.04	2.55
0.26	Very Good	D	VVS1	62.1	60	554	4.03	4.12	2.53
0.26	Ideal	E	VVS2	62.9	58	554	4.02	4.06	2.54
0.26	Good	E	VVS1	57.9	60	554	4.22	4.25	2.45
0.24	Premium	G	VVS1	62.3	59	554	3.95	3.92	2.45
0.24	Premium	H	VVS1	61.2	58	554	4.01	3.96	2.44
0.24	Premium	H	VVS1	60.8	59	554	4.02	4	2.44
0.24	Premium	H	VVS2	60.7	58	554	4.07	4.04	2.46
0.7	Ideal	G	VS2	61.6	56	2757	5.7	5.67	3.5
0.71	Very Good	E	VS2	62.4	57	2759	5.68	5.73	3.56
0.7	Good	E	VS2	57.5	58	2759	5.85	5.9	3.38
0.7	Good	F	VS1	59.4	62	2759	5.71	5.76	3.4
0.75	Premium	G	VS2	61.7	58	2760	5.85	5.79	3.59
0.8	Ideal	I	VS1	62.9	56	2760	5.94	5.87	3.72
0.74	Ideal	I	VVS2	62.3	55	2761	5.77	5.81	3.61
0.59	Ideal	E	VVS2	62	55	2761	5.38	5.43	3.35
0.9	Premium	I	VS2	63	58	2761	6.16	6.12	3.87
0.73	Ideal	F	VS2	62.6	56	2762	5.77	5.74	3.6
0.73	Ideal	F	VS2	62.7	53	2762	5.8	5.75	3.62
0.71	Ideal	G	VS2	62.4	54	2762	5.72	5.76	3.58
0.7	Ideal	E	VS2	60.7	58	2762	5.73	5.76	3.49
0.7	Very Good	F	VS2	61.7	63	2762	5.64	5.61	3.47
0.7	Fair	F	VS2	64.5	57	2762	5.57	5.53	3.58
0.7	Fair	F	VS2	65.3	55	2762	5.63	5.58	3.66
0.7	Premium	F	VS2	61.6	60	2762	5.65	5.59	3.46
0.61	Very Good	D	VVS2	59.6	57	2763	5.56	5.58	3.32
0.77	Ideal	H	VS2	62	56	2763	5.89	5.86	3.64
0.7	Very Good	E	VS2	62.6	60	2765	5.62	5.65	3.53
0.77	Very Good	H	VS1	61.3	60	2765	5.88	5.9	3.61
0.63	Premium	E	VVS1	60.9	60	2765	5.52	5.55	3.37
0.71	Very Good	F	VS1	60.1	62	2765	5.74	5.77	3.46
0.71	Premium	F	VS1	61.8	59	2765	5.69	5.73	3.53
0.64	Ideal	G	VVS1	61.9	56	2766	5.53	5.56	3.43
0.71	Premium	G	VS2	60.9	57	2766	5.78	5.75	3.51
0.71	Premium	G	VS2	59.8	56	2766	5.89	5.81	3.5
0.7	Very Good	D	VS2	61.8	55	2767	5.68	5.72	3.52
0.7	Very Good	F	VS1	60	57	2767	5.8	5.87	3.5
0.7	Good	H	VVS2	62.1	64	2767	5.62	5.65	3.5
0.71	Very Good	G	VS1	63.3	59	2768	5.52	5.61	3.52
0.71	Premium	D	VS2	62.5	60	2770	5.65	5.61	3.52
0.73	Premium	G	VS2	61.4	59	2770	5.83	5.76	3.56
0.73	Premium	G	VS2	60.7	58	2770	5.87	5.82	3.55
0.73	Premium	G	VS1	61.5	58	2770	5.79	5.75	3.55
0.73	Premium	G	VS2	59.2	59	2770	5.92	5.87	3.49
0.72	Very Good	H	VVS2	60.3	56	2771	5.81	5.83	3.51
0.71	Ideal	G	VS2	61.9	57	2771	5.73	5.77	3.56
0.73	Very Good	H	VVS1	60.4	59	2772	5.83	5.89	3.54
0.58	Ideal	G	VVS1	61.5	55	2772	5.39	5.44	3.33
0.58	Ideal	F	VVS1	61.7	56	2772	5.33	5.37	3.3
0.71	Good	E	VS2	59.2	61	2772	5.8	5.88	3.46
0.7	Premium	D	VS2	58	62	2773	5.87	5.78	3.38
0.6	Ideal	E	VS1	61.7	55	2774	5.41	5.44	3.35
0.83	Good	I	VS2	64.6	54	2774	5.85	5.88	3.79
0.74	Very Good	F	VS2	61.3	61	2775	5.8	5.84	3.57
0.72	Very Good	G	VS2	63.7	56.4	2776	5.62	5.69	3.61
0.71	Premium	E	VS2	62.7	58	2776	5.74	5.68	3.58
0.71	Ideal	E	VS2	62.2	57	2776	5.79	5.62	3.55
0.54	Ideal	E	VVS2	61.6	56	2776	5.25	5.27	3.24
0.54	Ideal	E	VVS2	61.5	57	2776	5.24	5.26	3.23
0.72	Good	G	VS2	59.7	60.5	2776	5.8	5.84	3.47
0.7	Very Good	D	VS1	62.7	58	2777	5.66	5.73	3.57
0.71	Premium	F	VS2	62.1	58	2777	5.67	5.7	3.53

Showing the first 1000 rows.

// Combining conditions
display(spark.sql("SELECT * FROM diamonds WHERE clarity LIKE 'V%' AND price > 10000"))


1.7	Ideal	J	VS2	60.5	58	10002	7.73	7.74	4.68
1.03	Ideal	E	VVS2	60.6	59	10003	6.5	6.53	3.95
1.23	Very Good	G	VVS2	60.6	55	10004	6.93	7.02	4.23
1.25	Ideal	F	VS2	61.6	55	10006	6.93	6.96	4.28
1.21	Very Good	F	VS1	62.3	58	10009	6.76	6.85	4.24
1.51	Premium	I	VS2	59.9	60	10010	7.42	7.36	4.43
1.05	Ideal	F	VVS2	60.5	55	10011	6.67	6.58	4.01
1.6	Ideal	J	VS1	62	53	10011	7.57	7.56	4.69
1.35	Premium	G	VS1	62.1	59	10012	7.06	7.02	4.37
1.53	Premium	I	VS2	62	58	10013	7.36	7.41	4.58
1.13	Ideal	F	VS1	60.9	57	10016	6.73	6.76	4.11
1.21	Premium	F	VS1	62.6	59	10018	6.81	6.76	4.25
1.01	Very Good	F	VVS1	62.9	57	10019	6.35	6.41	4.01
1.04	Ideal	E	VVS2	62.9	55	10019	6.47	6.51	4.08
1.26	Very Good	G	VVS2	60.9	56	10020	6.95	7.01	4.25
1.5	Very Good	H	VS2	60.9	59	10023	7.37	7.43	4.51
1.12	Premium	F	VVS2	62.4	59	10028	6.58	6.66	4.13
1.27	Premium	F	VS1	60.3	58	10028	7.06	7.04	4.25
1.52	Very Good	I	VS1	62.9	59.9	10032	7.27	7.31	4.59
1.24	Premium	F	VS1	62.5	58	10033	6.87	6.83	4.28
1.23	Very Good	F	VS1	62	59	10035	6.84	6.87	4.25
1.5	Good	G	VS1	63.6	57	10036	7.23	7.14	4.57
1.22	Ideal	G	VVS2	62.3	56	10038	6.81	6.84	4.25
1.3	Ideal	G	VS1	62	55	10038	6.98	7.02	4.34
1.59	Premium	I	VS2	60.2	60	10039	7.58	7.61	4.57
1.83	Premium	I	VS2	60.5	60	10043	7.93	7.86	4.78
1.07	Ideal	E	VVS2	61.4	56	10043	6.65	6.55	4.05
1.51	Very Good	H	VS1	61.5	54	10045	7.34	7.42	4.54
1.08	Ideal	F	VVS2	61.6	57	10046	6.57	6.6	4.06
1	Premium	D	VVS2	61.6	60	10046	6.41	6.36	3.93
1.03	Ideal	F	VVS2	61.1	57	10049	6.51	6.54	3.99
1.52	Very Good	I	VS2	62.3	58	10051	7.32	7.28	4.55
1.08	Ideal	F	VVS2	62.1	55	10052	6.57	6.6	4.09
1.2	Premium	G	VVS2	62.8	59	10053	6.72	6.65	4.2
1.2	Premium	E	VS1	60.7	57	10053	6.89	6.81	4.16
1.2	Premium	G	VVS2	61.2	58	10053	6.88	6.84	4.2
1.71	Premium	I	VS1	60.3	62	10055	7.76	7.7	4.66
1	Ideal	F	VVS1	62.3	53	10058	6.37	6.43	3.99
1.07	Ideal	F	VVS2	62.3	57	10061	6.56	6.58	4.09
1.66	Premium	J	VVS2	62.6	59	10062	7.58	7.54	4.73
1.2	Premium	F	VVS2	60.5	60	10064	6.98	6.87	4.19
1.11	Very Good	F	VVS1	62.5	59	10069	6.59	6.63	4.13
1.34	Ideal	G	VS1	62.7	57	10070	7.1	7.04	4.43
1.31	Premium	G	VS1	61.5	59	10071	7.06	7	4.32
1.31	Ideal	G	VS1	62.2	56	10071	7.05	7.01	4.37
1.31	Ideal	G	VS1	61.5	57	10071	7.06	7.02	4.33
1.53	Very Good	H	VS1	59.5	63	10076	7.51	7.44	4.45
1.26	Premium	F	VS1	62.7	58	10076	6.93	6.86	4.32
1.73	Ideal	J	VS2	63	57	10076	7.64	7.6	4.8
1.19	Ideal	D	VS1	61.1	57	10079	6.84	6.87	4.19
1.5	Ideal	I	VS1	61.3	57	10080	7.35	7.32	4.5
1.5	Premium	I	VS1	62.7	59	10080	7.3	7.25	4.56
1.5	Ideal	H	VS1	61.3	55	10080	7.37	7.34	4.51
1.21	Premium	D	VS1	60.2	59	10083	6.89	6.86	4.14
1.71	Premium	H	VS2	59.2	61	10084	7.83	7.77	4.62
1.82	Very Good	J	VS1	62.2	56	10090	7.83	7.96	4.91
1.51	Very Good	H	VS2	61.9	57	10090	7.32	7.36	4.54
1.3	Ideal	F	VS2	62.2	56	10090	6.98	6.94	4.33
1.3	Premium	F	VS2	60.4	59	10090	7.12	7.06	4.28
1.5	Very Good	I	VVS2	63.3	58	10090	7.27	7.24	4.59
1.57	Ideal	I	VS2	61.5	56	10093	7.56	7.49	4.63
1.07	Ideal	F	VVS2	60.3	55	10093	6.65	6.68	4.02
1.31	Very Good	E	VS2	63.1	56	10094	6.95	6.9	4.37
1.33	Good	G	VS1	62.8	60	10096	6.87	6.92	4.33
1.53	Premium	I	VS1	61.2	59	10098	7.39	7.41	4.53
1.61	Ideal	I	VS2	62.5	57	10098	7.49	7.43	4.66
1.31	Ideal	G	VS1	61.9	56	10099	7.03	7.13	4.38
1.22	Ideal	F	VS1	62.3	57	10100	6.83	6.79	4.24
1.07	Ideal	E	VVS2	61.7	57	10104	6.55	6.61	4.06
1.59	Very Good	I	VS2	60.5	63	10106	7.52	7.45	4.53
1.22	Premium	G	VVS2	62	58	10111	6.9	6.85	4.26
1.09	Premium	E	VVS2	59.9	59	10111	6.73	6.7	4.02
1.58	Very Good	I	VS1	61.8	57	10112	7.5	7.56	4.64
1	Very Good	D	VVS2	61.7	58	10113	6.37	6.41	3.94
1.23	Ideal	G	VVS1	63.2	56	10113	6.78	6.83	4.3
1.25	Ideal	D	VS2	62.6	56	10114	6.87	6.84	4.29
1.17	Premium	D	VS1	61.7	59	10115	6.77	6.72	4.16
1.28	Ideal	G	VS1	62.1	57	10126	6.91	6.94	4.3
1.43	Ideal	H	VVS2	61.6	54	10129	7.25	7.29	4.48
1.51	Good	H	VS1	59.9	61	10129	7.34	7.39	4.41
1.52	Very Good	I	VS2	61.7	55	10130	7.39	7.32	4.54
1.04	Very Good	D	VVS2	60.8	58	10130	6.49	6.53	3.96
1.07	Ideal	E	VVS2	62.3	56	10133	6.51	6.61	4.09
1.5	Good	F	VS2	64	56	10134	7.18	7.13	4.64
1	Premium	E	VVS1	60.3	54	10134	6.59	6.47	3.94
1.21	Premium	E	VS1	60.3	58	10137	6.95	6.91	4.18
1.24	Ideal	F	VS1	61.5	54	10138	6.93	6.89	4.25
1.24	Ideal	F	VS1	60.9	54	10138	6.98	6.95	4.24
1.11	Very Good	F	VVS1	59.7	55	10141	6.77	6.82	4.06
1.1	Ideal	D	VS1	61.9	56	10144	6.58	6.61	4.09
1.01	Premium	D	VVS2	60.2	58	10147	6.57	6.51	3.94
1.31	Ideal	G	VS1	60.5	57	10155	7.1	7.14	4.31
1.2	Premium	D	VS2	61.1	58	10161	6.85	6.83	4.18
1.5	Very Good	I	VS1	62.2	59	10164	7.27	7.3	4.53
1.54	Premium	I	VS1	61.6	58	10164	7.39	7.42	4.56
1.54	Good	I	VS1	63.6	60	10164	7.3	7.33	4.65
1.5	Ideal	I	VS1	62	54	10164	7.32	7.38	4.56
1.67	Very Good	I	VS2	60.7	60	10165	7.61	7.68	4.64
1.7	Very Good	J	VS1	62.9	58	10165	7.54	7.67	4.79
1.53	Ideal	I	VS1	60.2	60	10171	7.51	7.48	4.51

Showing the first 1000 rows.

// selecting a subset of fields
display(spark.sql("SELECT carat, clarity, price FROM diamonds WHERE color = 'D'"))


0.23	VS2	357
0.23	VS1	402
0.26	VS2	403
0.26	VS2	403
0.26	VS1	403
0.22	VS2	404
0.3	SI1	552
0.3	SI1	552
0.3	SI1	552
0.24	VVS1	553
0.26	VVS2	554
0.26	VVS2	554
0.26	VVS1	554
0.75	SI1	2760
0.71	SI2	2762
0.61	VVS2	2763
0.71	SI1	2764
0.71	SI1	2764
0.7	VS2	2767
0.71	SI2	2767
0.73	SI1	2768
0.7	SI1	2768
0.71	SI2	2768
0.71	VS2	2770
0.76	SI2	2770
0.73	SI2	2770
0.75	SI2	2773
0.7	VS2	2773
0.7	VS1	2777
0.53	VVS2	2782
0.75	SI2	2782
0.72	SI1	2782
0.72	SI1	2782
0.7	SI1	2782
0.64	VS1	2787
0.71	VS2	2788
0.72	SI2	2795
0.71	SI1	2797
0.71	SI1	2797
0.71	SI1	2797
0.51	VVS1	2797
0.78	SI1	2799
0.91	SI2	2803
0.7	SI1	2804
0.7	SI1	2804
0.72	SI1	2804
0.72	SI1	2804
0.73	SI1	2808
0.81	SI2	2809
0.74	SI2	2810
0.83	SI1	2811
0.71	SI1	2812
0.55	VVS1	2815
0.71	VS1	2816
0.73	SI1	2821
0.71	SI1	2822
0.71	SI1	2822
0.7	SI1	2822
0.7	SI1	2822
0.71	SI1	2822
0.7	SI1	2822
0.7	SI1	2822
0.7	SI1	2822
0.7	SI1	2822
0.79	SI2	2823
0.71	VS2	2824
0.7	VS2	2826
0.7	SI1	2827
0.72	VS2	2827
0.7	SI2	2828
0.7	VS2	2833
0.7	VS2	2833
0.51	VVS1	2834
0.92	SI2	2840
0.71	VS1	2841
0.73	SI1	2841
0.73	SI1	2841
0.71	SI1	2843
0.79	SI1	2846
0.76	SI1	2847
0.54	VVS2	2848
0.75	SI2	2848
0.66	VS1	2851
0.79	SI2	2853
0.79	SI2	2853
0.74	VS2	2855
0.73	SI1	2858
0.71	VS2	2858
0.71	VS2	2858
0.7	VS2	2859
0.7	VS2	2859
0.7	VS2	2859
0.71	VS1	2860
0.71	SI1	2861
0.66	VS1	2861
0.7	SI1	2862
0.8	SI2	2862
0.71	SI1	2863
0.71	SI1	2863
0.71	SI1	2863

Showing the first 1000 rows.

//renaming a field using as
display(spark.sql("SELECT carat AS carrot, clarity, price FROM diamonds"))


0.23	SI2	326
0.21	SI1	326
0.23	VS1	327
0.29	VS2	334
0.31	SI2	335
0.24	VVS2	336
0.24	VVS1	336
0.26	SI1	337
0.22	VS2	337
0.23	VS1	338
0.3	SI1	339
0.23	VS1	340
0.22	SI1	342
0.31	SI2	344
0.2	SI2	345
0.32	I1	345
0.3	SI2	348
0.3	SI1	351
0.3	SI1	351
0.3	SI1	351
0.3	SI2	351
0.23	VS2	352
0.23	VS1	353
0.31	SI1	353
0.31	SI1	353
0.23	VVS2	354
0.24	VS1	355
0.3	VS2	357
0.23	VS2	357
0.23	VS1	357
0.23	VS1	402
0.23	VS1	402
0.23	VS1	402
0.23	VS1	402
0.23	VS1	402
0.23	VS1	402
0.23	VS1	402
0.31	SI1	402
0.26	VS2	403
0.33	SI2	403
0.33	SI2	403
0.33	SI1	403
0.26	VS2	403
0.26	VS1	403
0.32	SI2	403
0.29	SI1	403
0.32	SI2	403
0.32	SI2	403
0.25	VS2	404
0.29	SI2	404
0.24	SI1	404
0.23	VS1	404
0.32	SI1	404
0.22	VS2	404
0.22	VS2	404
0.3	SI2	405
0.3	SI2	405
0.3	SI1	405
0.3	SI1	405
0.3	SI1	405
0.35	VS1	552
0.3	SI1	552
0.3	SI1	552
0.3	SI1	552
0.42	SI2	552
0.28	VVS2	553
0.32	VVS1	553
0.31	SI1	553
0.31	SI1	553
0.24	VVS1	553
0.24	VVS1	553
0.3	SI1	554
0.3	SI1	554
0.3	SI1	554
0.3	SI1	554
0.26	VVS2	554
0.26	VVS2	554
0.26	VVS2	554
0.26	VVS2	554
0.26	VVS1	554
0.26	VVS1	554
0.26	VVS1	554
0.26	VVS2	554
0.38	SI2	554
0.26	VVS1	554
0.24	VVS1	554
0.24	VVS1	554
0.24	VVS1	554
0.24	VVS2	554
0.32	SI1	554
0.7	SI1	2757
0.86	SI2	2757
0.7	VS2	2757
0.71	VS2	2759
0.78	SI2	2759
0.7	VS2	2759
0.7	VS1	2759
0.96	SI2	2759
0.73	SI1	2760
0.8	SI1	2760

Showing the first 1000 rows.

//sorting
display(spark.sql("SELECT carat, clarity, price FROM diamonds ORDER BY price DESC"))


2.29	VS2	18823
2	SI1	18818
1.51	IF	18806
2.07	SI2	18804
2	SI1	18803
2.29	SI1	18797
2	VS1	18795
2.04	SI1	18795
1.71	VS2	18791
2.15	SI2	18791
2.8	SI2	18788
2.05	SI1	18787
2.05	SI2	18784
2.03	SI1	18781
1.6	VS1	18780
2.06	VS2	18779
1.51	VVS1	18777
1.71	VVS2	18768
2.55	VS1	18766
2.08	SI1	18760
2	SI1	18759
2.03	SI1	18757
2.61	SI2	18756
2.36	SI2	18745
2.01	SI1	18741
2.01	SI1	18741
2.01	SI1	18741
2.01	SI1	18736
1.94	SI1	18735
2.02	SI1	18731
1.72	VVS2	18730
1.51	VS1	18729
1.7	VVS2	18718
2.18	SI1	18717
3.01	SI2	18710
3.01	SI2	18710
2	SI1	18709
2.07	VS2	18707
2.22	VS1	18706
2.01	SI2	18705
3.51	VS2	18701
1.28	IF	18700
2.02	VS2	18700
2.19	SI2	18693
2.43	VS2	18692
2.48	SI2	18692
1.5	VS2	18691
2.67	SI2	18686
1.42	VVS1	18682
2.03	VS2	18680
2.02	SI2	18678
2.16	SI2	18678
2.01	SI2	18674
2.04	SI1	18663
2.05	VS2	18659
2.12	SI1	18656
2.29	VS2	18653
2.1	SI1	18648
2.01	VS2	18640
2.09	SI2	18640
2.03	SI1	18630
2.01	SI1	18625
2.42	VS2	18615
1.49	VVS2	18614
2.07	SI2	18611
2.01	VS2	18607
2	SI1	18604
1.71	VVS2	18599
1.7	VS1	18598
2.29	IF	18594
3.01	SI2	18593
2.03	SI2	18578
2.11	SI2	18575
2.01	SI1	18574
2.01	SI1	18572
1.6	VS1	18571
2.02	VS2	18565
2.01	VS2	18561
2.01	VS2	18561
2.09	SI1	18559
3.04	SI2	18559
2.38	VS2	18559
1.72	VS2	18557
1.5	IF	18552
1.04	IF	18542
2.4	SI1	18541
2.4	SI2	18541
2.03	SI2	18535
2.32	SI2	18532
2.22	VS2	18531
4.5	I1	18531
2.14	SI1	18528
2.14	SI2	18526
1.83	VS2	18525
2	SI1	18524
2.38	VS1	18522
2	VS2	18515
2.09	SI2	18509
2.32	SI2	18508
2.37	SI1	18508

Showing the first 1000 rows.

diamondsDF.printSchema // since price is double in the DF that was turned into table we can rely on the descenting sort on doubles

// sort by multiple fields
display(spark.sql("SELECT carat, clarity, price FROM diamonds ORDER BY carat ASC, price DESC"))

0.2	VS2	367
0.2	VS2	367
0.2	VS2	367
0.2	VS2	367
0.2	VS2	367
0.2	VS2	367
0.2	VS2	367
0.2	VS2	367
0.2	VS2	367
0.2	VS2	367
0.2	VS2	367
0.2	SI2	345
0.21	SI2	394
0.21	VS2	386
0.21	VS2	386
0.21	VS2	386
0.21	VS2	386
0.21	VS2	386
0.21	VS2	386
0.21	VS2	386
0.21	SI1	326
0.22	SI1	470
0.22	VS2	404
0.22	VS2	404
0.22	SI1	342
0.22	VS2	337
0.23	VVS2	688
0.23	VVS1	682
0.23	VVS1	680
0.23	VVS1	680
0.23	VVS1	680
0.23	VVS2	680
0.23	VVS2	680
0.23	VVS2	680
0.23	VVS2	680
0.23	VVS2	650
0.23	VVS2	640
0.23	VVS1	640
0.23	VS1	611
0.23	VVS2	600
0.23	VS1	586
0.23	VS1	586
0.23	VVS2	583
0.23	VVS2	583
0.23	VVS1	583
0.23	VVS1	583
0.23	VVS1	583
0.23	VVS1	583
0.23	VVS1	583
0.23	VVS2	583
0.23	VS2	577
0.23	VVS2	571
0.23	VVS2	550
0.23	VVS2	549
0.23	VS2	548
0.23	VS1	548
0.23	VS1	548
0.23	VS2	548
0.23	VS2	543
0.23	VVS2	538
0.23	VVS2	537
0.23	IF	536
0.23	VVS1	536
0.23	IF	536
0.23	VVS1	536
0.23	IF	536
0.23	VVS1	536
0.23	VVS1	531
0.23	VVS1	530
0.23	VVS2	530
0.23	VVS2	530
0.23	VVS2	530
0.23	VVS2	530
0.23	VVS2	530
0.23	VVS2	530
0.23	VVS2	530
0.23	VVS2	530
0.23	VVS2	530
0.23	VVS2	530
0.23	VVS2	530
0.23	VVS2	530
0.23	VVS2	530
0.23	VVS2	530
0.23	VVS2	530
0.23	VVS2	530
0.23	VVS2	530
0.23	VVS2	530
0.23	VVS2	530
0.23	VVS2	530
0.23	VVS2	530
0.23	VVS2	530
0.23	VVS2	530
0.23	VVS2	530
0.23	VVS2	530
0.23	VVS1	530
0.23	VVS1	530
0.23	VVS1	530
0.23	VVS1	530
0.23	VVS1	530
0.23	VVS1	530

Showing the first 1000 rows.

SDS-2.x, Scalable Data Engineering Science

Diamonds ML Pipeline Workflow - DataFrame ETL and EDA Part

Step 1. Load data as DataFrame

Step 2. Understand the data

Let's first look at the categorical features.

Let us run through some basic inteactive SQL queries next