008_DiamondsPipeline_01ETLEDA(Scala)

Diamonds ML Pipeline Workflow - DataFrame ETL and EDA Part

This is the Spark SQL parts that are focussed on extract-transform-Load (ETL) and exploratory-data-analysis (EDA) parts of an end-to-end example of a Machine Learning (ML) workflow.

Why are we using DataFrames? This is because of the Announcement in the Spark MLlib Main Guide for Spark 2.2 https://spark.apache.org/docs/latest/ml-guide.html that "DataFrame-based API is primary API".

This notebook is a scalarific break-down of the pythonic 'Diamonds ML Pipeline Workflow' from the Databricks Guide.

We will see this example again in the sequel

For this example, we analyze the Diamonds dataset from the R Datasets hosted on DBC.

Later on, we will use the DecisionTree algorithm to predict the price of a diamond from its characteristics.

Here is an outline of our pipeline:

  • Step 1. Load data: Load data as DataFrame
  • Step 2. Understand the data: Compute statistics and create visualizations to get a better understanding of the data.
  • Step 3. Hold out data: Split the data randomly into training and test sets. We will not look at the test data until after learning.
  • Step 4. On the training dataset:
    • Extract features: We will index categorical (String-valued) features so that DecisionTree can handle them.
    • Learn a model: Run DecisionTree to learn how to predict a diamond's price from a description of the diamond.
    • Tune the model: Tune the tree depth (complexity) using the training data. (This process is also called model selection.)
  • Step 5. Evaluate the model: Now look at the test dataset. Compare the initial model with the tuned model to see the benefit of tuning parameters.
  • Step 6. Understand the model: We will examine the learned model and results to gain further insight.

In this notebook, we will only cover Step 1 and Step 2. above. The other Steps will be revisited in the sequel.

Step 1. Load data as DataFrame

This section loads a dataset as a DataFrame and examines a few rows of it to understand the schema.

For more info, see the DB guide on importing data.

// We'll use the Diamonds dataset from the R datasets hosted on DBC.
val diamondsFilePath = "dbfs:/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv"
diamondsFilePath: String = dbfs:/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv
sc.textFile(diamondsFilePath).take(2) // looks like a csv file as it should
res46: Array[String] = Array("","carat","cut","color","clarity","depth","table","price","x","y","z", "1",0.23,"Ideal","E","SI2",61.5,55,326,3.95,3.98,2.43)
val diamondsRawDF = sqlContext.read    // we can use sqlContext instead of SparkSession for backwards compatibility to 1.x
    .format("com.databricks.spark.csv") // use spark.csv package
    .option("header", "true") // Use first line of all files as header
    .option("inferSchema", "true") // Automatically infer data types
    //.option("delimiter", ",") // Specify the delimiter as comma or ',' DEFAULT
    .load(diamondsFilePath)
diamondsRawDF: org.apache.spark.sql.DataFrame = [_c0: int, carat: double ... 9 more fields]
//There are 10 columns.  We will try to predict the price of diamonds, treating the other 9 columns as features.
diamondsRawDF.printSchema()
root |-- _c0: integer (nullable = true) |-- carat: double (nullable = true) |-- cut: string (nullable = true) |-- color: string (nullable = true) |-- clarity: string (nullable = true) |-- depth: double (nullable = true) |-- table: double (nullable = true) |-- price: integer (nullable = true) |-- x: double (nullable = true) |-- y: double (nullable = true) |-- z: double (nullable = true)

Note: (nullable = true) simply means if the value is allowed to be null.

Let us count the number of rows in diamondsDF.

diamondsRawDF.count() // Ctrl+Enter
res49: Long = 53940

So there are 53940 records or rows in the DataFrame.

Use the show(n) method to see the first n (default is 20) rows of the DataFrame, as folows:

diamondsRawDF.show(10)
+---+-----+---------+-----+-------+-----+-----+-----+----+----+----+ |_c0|carat| cut|color|clarity|depth|table|price| x| y| z| +---+-----+---------+-----+-------+-----+-----+-----+----+----+----+ | 1| 0.23| Ideal| E| SI2| 61.5| 55.0| 326|3.95|3.98|2.43| | 2| 0.21| Premium| E| SI1| 59.8| 61.0| 326|3.89|3.84|2.31| | 3| 0.23| Good| E| VS1| 56.9| 65.0| 327|4.05|4.07|2.31| | 4| 0.29| Premium| I| VS2| 62.4| 58.0| 334| 4.2|4.23|2.63| | 5| 0.31| Good| J| SI2| 63.3| 58.0| 335|4.34|4.35|2.75| | 6| 0.24|Very Good| J| VVS2| 62.8| 57.0| 336|3.94|3.96|2.48| | 7| 0.24|Very Good| I| VVS1| 62.3| 57.0| 336|3.95|3.98|2.47| | 8| 0.26|Very Good| H| SI1| 61.9| 55.0| 337|4.07|4.11|2.53| | 9| 0.22| Fair| E| VS2| 65.1| 61.0| 337|3.87|3.78|2.49| | 10| 0.23|Very Good| H| VS1| 59.4| 61.0| 338| 4.0|4.05|2.39| +---+-----+---------+-----+-------+-----+-----+-----+----+----+----+ only showing top 10 rows

If you notice the schema of diamondsRawDF you will see that the automatic schema inference of SqlContext.read method has cast the values in the column price as integer.

To cleanup:

  • let's recast the column price as double for downstream ML tasks later and
  • let's also get rid of the first column of row indices.
import org.apache.spark.sql.types.DoubleType
//we will convert price column from int to double for being able to model, fit and predict in downstream ML task
val diamondsDF = diamondsRawDF.select($"carat", $"cut", $"color", $"clarity", $"depth", $"table",$"price".cast(DoubleType).as("price"), $"x", $"y", $"z")
diamondsDF.cache() // let's cache it for reuse
diamondsDF.printSchema // print schema
root |-- carat: double (nullable = true) |-- cut: string (nullable = true) |-- color: string (nullable = true) |-- clarity: string (nullable = true) |-- depth: double (nullable = true) |-- table: double (nullable = true) |-- price: double (nullable = true) |-- x: double (nullable = true) |-- y: double (nullable = true) |-- z: double (nullable = true) import org.apache.spark.sql.types.DoubleType diamondsDF: org.apache.spark.sql.DataFrame = [carat: double, cut: string ... 8 more fields]
diamondsDF.show(10,false) // notice that price column has Double values that end in '.0' now
+-----+---------+-----+-------+-----+-----+-----+----+----+----+ |carat|cut |color|clarity|depth|table|price|x |y |z | +-----+---------+-----+-------+-----+-----+-----+----+----+----+ |0.23 |Ideal |E |SI2 |61.5 |55.0 |326.0|3.95|3.98|2.43| |0.21 |Premium |E |SI1 |59.8 |61.0 |326.0|3.89|3.84|2.31| |0.23 |Good |E |VS1 |56.9 |65.0 |327.0|4.05|4.07|2.31| |0.29 |Premium |I |VS2 |62.4 |58.0 |334.0|4.2 |4.23|2.63| |0.31 |Good |J |SI2 |63.3 |58.0 |335.0|4.34|4.35|2.75| |0.24 |Very Good|J |VVS2 |62.8 |57.0 |336.0|3.94|3.96|2.48| |0.24 |Very Good|I |VVS1 |62.3 |57.0 |336.0|3.95|3.98|2.47| |0.26 |Very Good|H |SI1 |61.9 |55.0 |337.0|4.07|4.11|2.53| |0.22 |Fair |E |VS2 |65.1 |61.0 |337.0|3.87|3.78|2.49| |0.23 |Very Good|H |VS1 |59.4 |61.0 |338.0|4.0 |4.05|2.39| +-----+---------+-----+-------+-----+-----+-----+----+----+----+ only showing top 10 rows
//View DataFrame in databricks
// note this 'display' is a databricks notebook specific command that is quite powerful for visual interaction with the data
// other notebooks like zeppelin have similar commands for interactive visualisation
display(diamondsDF) 
0.23IdealESI261.5553263.953.982.43
0.21PremiumESI159.8613263.893.842.31
0.23GoodEVS156.9653274.054.072.31
0.29PremiumIVS262.4583344.24.232.63
0.31GoodJSI263.3583354.344.352.75
0.24Very GoodJVVS262.8573363.943.962.48
0.24Very GoodIVVS162.3573363.953.982.47
0.26Very GoodHSI161.9553374.074.112.53
0.22FairEVS265.1613373.873.782.49
0.23Very GoodHVS159.46133844.052.39
0.3GoodJSI164553394.254.282.73
0.23IdealJVS162.8563403.933.92.46
0.22PremiumFSI160.4613423.883.842.33
0.31IdealJSI262.2543444.354.372.71
0.2PremiumESI260.2623453.793.752.27
0.32PremiumEI160.9583454.384.422.68
0.3IdealISI262543484.314.342.68
0.3GoodJSI163.4543514.234.292.7
0.3GoodJSI163.8563514.234.262.71
0.3Very GoodJSI162.7593514.214.272.66
0.3GoodISI263.3563514.264.32.71
0.23Very GoodEVS263.8553523.853.922.48
0.23Very GoodHVS161573533.943.962.41
0.31Very GoodJSI159.4623534.394.432.62
0.31Very GoodJSI158.1623534.444.472.59
0.23Very GoodGVVS260.4583543.974.012.41
0.24PremiumIVS162.5573553.973.942.47
0.3Very GoodJVS262.2573574.284.32.67
0.23Very GoodDVS260.5613573.963.972.4
0.23Very GoodFVS160.9573573.963.992.42
0.23Very GoodFVS1605740244.032.41
0.23Very GoodFVS159.8574024.044.062.42
0.23Very GoodEVS160.7594023.974.012.42
0.23Very GoodEVS159.5584024.014.062.4
0.23Very GoodDVS161.9584023.923.962.44
0.23GoodFVS158.2594024.064.082.37
0.23GoodEVS164.1594023.833.852.46
0.31GoodHSI164544024.294.312.75
0.26Very GoodDVS260.8594034.134.162.52
0.33IdealISI261.8554034.494.512.78
0.33IdealISI261.2564034.494.52.75
0.33IdealJSI161.1564034.494.552.76
0.26GoodDVS265.2564033.994.022.61
0.26GoodDVS158.4634034.194.242.46
0.32GoodHSI263.1564034.344.372.75
0.29PremiumFSI162.4584034.244.262.65
0.32Very GoodHSI261.8554034.354.422.71
0.32GoodHSI263.8564034.364.382.79
0.25Very GoodEVS263.36040444.032.54
0.29Very GoodHSI260.7604044.334.372.64
0.24Very GoodFSI160.9614044.024.032.45
0.23IdealGVS161.9544043.933.952.44
0.32IdealISI160.9554044.454.482.72
0.22PremiumEVS261.6584043.933.892.41
0.22PremiumDVS259.3624043.913.882.31
0.3IdealISI261594054.34.332.63
0.3PremiumJSI259.3614054.434.382.61
0.3Very GoodISI162.6574054.254.282.67
0.3Very GoodISI163574054.284.322.71
0.3GoodISI163.2554054.254.292.7
0.35IdealIVS160.9575524.544.592.78
0.3PremiumDSI162.6595524.234.272.66
0.3IdealDSI162.5575524.294.322.69
0.3IdealDSI162.1565524.34.332.68
0.42PremiumISI261.5595524.784.842.96
0.28IdealGVVS261.4565534.194.222.58
0.32IdealIVVS16255.35534.394.422.73
0.31Very GoodGSI163.3575534.334.32.73
0.31PremiumGSI161.8585534.354.322.68
0.24PremiumEVVS160.7585534.014.032.44
0.24Very GoodDVVS161.5605533.9742.45
0.3Very GoodHSI163.1565544.294.272.7
0.3PremiumHSI162.9595544.284.242.68
0.3PremiumHSI162.5575544.294.252.67
0.3GoodHSI163.7575544.284.262.72
0.26Very GoodFVVS259.2605544.194.222.49
0.26Very GoodEVVS259.9585544.154.232.51
0.26Very GoodDVVS262.4545544.084.132.56
0.26Very GoodDVVS262.8605544.014.052.53
0.26Very GoodEVVS162.6595544.064.092.55
0.26Very GoodEVVS163.45955444.042.55
0.26Very GoodDVVS162.1605544.034.122.53
0.26IdealEVVS262.9585544.024.062.54
0.38IdealISI261.6565544.654.672.87
0.26GoodEVVS157.9605544.224.252.45
0.24PremiumGVVS162.3595543.953.922.45
0.24PremiumHVVS161.2585544.013.962.44
0.24PremiumHVVS160.8595544.0242.44
0.24PremiumHVVS260.7585544.074.042.46
0.32PremiumISI162.9585544.354.332.73
0.7IdealESI162.55727575.75.723.57
0.86FairESI255.16927576.456.333.52
0.7IdealGVS261.65627575.75.673.5
0.71Very GoodEVS262.45727595.685.733.56
0.78Very GoodGSI263.85627595.815.853.72
0.7GoodEVS257.55827595.855.93.38
0.7GoodFVS159.46227595.715.763.4
0.96FairFSI266.36227596.275.954.07
0.73Very GoodESI161.65927605.775.783.56
0.8PremiumHSI161.55827605.975.933.66

Showing the first 1000 rows.

Step 2. Understand the data

Let's examine the data to get a better understanding of what is there. We only examine a couple of features (columns), but it gives an idea of the type of exploration you might do to understand a new dataset.

For more examples of using Databricks's visualization (even across languages) see https://docs.databricks.com/user-guide/visualizations/index.html NOW.

We can see that we have a mix of

  • categorical features (cut, color, clarity) and
  • continuous features (depth, x, y, z).
Let's first look at the categorical features.

You can also select one or more individual columns using so-called DataFrame API.

Let us select the column cut from diamondsDF and create a new DataFrame called cutsDF and then display it as follows:

val cutsDF = diamondsDF.select("cut") // Shift+Enter
cutsDF: org.apache.spark.sql.DataFrame = [cut: string]
cutsDF.show(10) // Ctrl+Enter
+---------+ | cut| +---------+ | Ideal| | Premium| | Good| | Premium| | Good| |Very Good| |Very Good| |Very Good| | Fair| |Very Good| +---------+ only showing top 10 rows

Let us use distinct to find the distinct types of cut's in the dataset.

// View distinct diamond cuts in dataset
val cutsDistinctDF = diamondsDF.select("cut").distinct()
cutsDistinctDF: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [cut: string]
cutsDistinctDF.show()
+---------+ | cut| +---------+ | Premium| | Ideal| | Good| | Fair| |Very Good| +---------+

Clearly, there are just 5 kinds of cuts.

// View distinct diamond colors in dataset
val colorsDistinctDF = diamondsDF.select("color").distinct() //.collect()
colorsDistinctDF.show()
+-----+ |color| +-----+ | F| | E| | D| | J| | G| | I| | H| +-----+ colorsDistinctDF: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [color: string]
// View distinct diamond clarities in dataset
val claritiesDistinctDF = diamondsDF.select("clarity").distinct() // .collect()
claritiesDistinctDF.show()
+-------+ |clarity| +-------+ | VVS2| | SI1| | IF| | I1| | VVS1| | VS2| | SI2| | VS1| +-------+ claritiesDistinctDF: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [clarity: string]

We can examine the distribution of a particular feature by using display(),

You Try!

  1. Click on the chart icon and Plot Options, and setting:
    • Value=<id>
    • Series groupings='cut'
    • and Aggregation=COUNT.
  2. You can also try this using columns "color" and "clarity"
display(diamondsDF.select("cut"))
0.00501001502002503003500.00501001502002503003500.00501001502002503003500.00501001502002503003500.0050100150200250300350TOOLTIPIdealPremiumGoodVery GoodFair<id>cutIdealIdealPremiumPremiumGoodGoodVery GoodVery GoodFairFair

Showing sample based on the first 1000 rows.

// come on do the same for color NOW!
// and clarity too...

You Try!

Now play around with display of the entire DF and choosing what you want in the GUI as opposed to a .select(...) statement earlier.

For instance, the following display(diamondsDF) shows the counts of the colors by choosing in the Plot Options a bar-chart that is grouped with Series Grouping as color, values as <id> and Aggregation as COUNT. You can click on Plot Options to see these settings and can change them as you wish by dragging and dropping.

 display(diamondsDF)
0.00501001502002500.00501001502002500.00501001502002500.00501001502002500.00501001502002500.00501001502002500.0050100150200250TOOLTIPEIJHFGD<id>colorEEIIJJHHFFGGDD

Showing sample based on the first 1000 rows.

Now let's examine one of the continuous features as an example.

//Select: "Plot Options..." --> "Display type" --> "histogram plot" and choose to "Plot over all results" OTHERWISE you get the image from first 1000 rows only
display(diamondsDF.select("carat"))
0.000.020.040.060.080.100.120.140.160.180.200.220.240.260.280.300.320.340.000.501.001.52.02.53.03.54.04.55.0caratDensity

The above histogram of the diamonds' carat ratings shows that carats have a skewed distribution: Many diamonds are small, but there are a number of diamonds in the dataset which are much larger.

  • Extremely skewed distributions can cause problems for some algorithms (e.g., Linear Regression).
  • However, Decision Trees handle skewed distributions very naturally.

Note: When you call display to create a histogram like that above, it will plot using a subsample from the dataset (for efficiency), but you can plot using the full dataset by selecting "Plot over all results". For our dataset, the two plots can actually look very different due to the long-tailed distribution.

We will not examine the label distribution for now. It can be helpful to examine the label distribution, but it is best to do so only on the training set, not on the test set which we will hold out for evaluation. These will be seen in the sequel

You Try! Of course knock youself out visually exploring the dataset more...

display(diamondsDF.select("cut","carat"))
0.002.0k4.0k6.0k8.0k10k12k14k16kFairGoodIdealPremiumVery GoodTOOLTIPcutcarat

Try scatter plot to see pairwise scatter plots of continuous features.

display(diamondsDF) //Ctrl+Enter 
z4.005.006.007.0055.060.065.070.04.005.00yx5.002.00kprice4.005.006.007.005001.00k1.50k2.00k2.50k60.0table

Showing sample based on the first 1000 rows.

Note that columns of type string are not in the scatter plot!

diamondsDF.printSchema // Ctrl+Enter
root |-- carat: double (nullable = true) |-- cut: string (nullable = true) |-- color: string (nullable = true) |-- clarity: string (nullable = true) |-- depth: double (nullable = true) |-- table: double (nullable = true) |-- price: double (nullable = true) |-- x: double (nullable = true) |-- y: double (nullable = true) |-- z: double (nullable = true)

Let us run through some basic inteactive SQL queries next

  • HiveQL supports =, <, >, <=, >= and != operators. It also supports LIKE operator for fuzzy matching of Strings
  • Enclose Strings in single quotes
  • Multiple conditions can be combined using and and or
  • Enclose conditions in () for precedence
  • ...
  • ...

Why do I need to learn interactive SQL queries?

Such queries in the widely known declarative SQL language can help us explore the data and thereby inform the modeling process!!!

Using DataFrame API, we can apply a filter after select to transform the DataFrame diamondsDF to the new DataFrame diamondsDColoredDF.

Below, $ is an alias for column.

Let as select the columns named carat, colour, price where color value is equal to D.

val diamondsDColoredDF = diamondsDF.select("carat", "color", "price").filter($"color" === "D") // Shift+Enter
diamondsDColoredDF: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [carat: double, color: string ... 1 more field]
diamondsDColoredDF.show(10) // Ctrl+Enter
+-----+-----+-----+ |carat|color|price| +-----+-----+-----+ | 0.23| D|357.0| | 0.23| D|402.0| | 0.26| D|403.0| | 0.26| D|403.0| | 0.26| D|403.0| | 0.22| D|404.0| | 0.3| D|552.0| | 0.3| D|552.0| | 0.3| D|552.0| | 0.24| D|553.0| +-----+-----+-----+ only showing top 10 rows

As you can see all the colors are now 'D'. But to really confirm this we can do the following for fun:

diamondsDColoredDF.select("color").distinct().show
+-----+ |color| +-----+ | D| +-----+

Let's try to do the same in SQL for those who know SQL from before.

First we need to see if the table is registerd (not just the DataFrame), and if not we ened to register our DataFrame as a temporary table.

sqlContext.tables.show() // Ctrl+Enter to see available tables
+--------+--------------------+-----------+ |database| tableName|isTemporary| +--------+--------------------+-----------+ | default| adult| false| | default| business_csv_csv| false| | default| checkin_table| false| | default| diamonds| false| | default| inventory| false| | default|item_merchant_cat...| false| | default| items_left_csv| false| | default| logistic_detail| false| | default| merchant_ratings| false| | default| order_data| false| | default| order_ids_left_csv| false| | default| repeat_csv| false| | default| review_2019_csv| false| | default|sample_logistic_t...| false| | default| sentimentlex_csv| false| | default| simple_range| false| | default| social_media_usage| false| | default| tip_json| false| | default| tips_csv_csv| false| | default| users_csv| false| +--------+--------------------+-----------+ only showing top 20 rows

Looks like diamonds is already there (if not just execute the following cell).

diamondsDF.createOrReplaceTempView("diamonds")
sqlContext.tables.show() // Ctrl+Enter to see available tables
+--------+--------------------+-----------+ |database| tableName|isTemporary| +--------+--------------------+-----------+ | default| adult| false| | default| business_csv_csv| false| | default| checkin_table| false| | default| diamonds| false| | default| inventory| false| | default|item_merchant_cat...| false| | default| items_left_csv| false| | default| logistic_detail| false| | default| merchant_ratings| false| | default| order_data| false| | default| order_ids_left_csv| false| | default| repeat_csv| false| | default| review_2019_csv| false| | default|sample_logistic_t...| false| | default| sentimentlex_csv| false| | default| simple_range| false| | default| social_media_usage| false| | default| tip_json| false| | default| tips_csv_csv| false| | default| users_csv| false| +--------+--------------------+-----------+ only showing top 20 rows
%sql -- Shift+Enter to do the same in SQL
select carat, color, price from diamonds where color='D'
0.23D357
0.23D402
0.26D403
0.26D403
0.26D403
0.22D404
0.3D552
0.3D552
0.3D552
0.24D553
0.26D554
0.26D554
0.26D554
0.75D2760
0.71D2762
0.61D2763
0.71D2764
0.71D2764
0.7D2767
0.71D2767
0.73D2768
0.7D2768
0.71D2768
0.71D2770
0.76D2770
0.73D2770
0.75D2773
0.7D2773
0.7D2777
0.53D2782
0.75D2782
0.72D2782
0.72D2782
0.7D2782
0.64D2787
0.71D2788
0.72D2795
0.71D2797
0.71D2797
0.71D2797
0.51D2797
0.78D2799
0.91D2803
0.7D2804
0.7D2804
0.72D2804
0.72D2804
0.73D2808
0.81D2809
0.74D2810
0.83D2811
0.71D2812
0.55D2815
0.71D2816
0.73D2821
0.71D2822
0.71D2822
0.7D2822
0.7D2822
0.71D2822
0.7D2822
0.7D2822
0.7D2822
0.7D2822
0.79D2823
0.71D2824
0.7D2826
0.7D2827
0.72D2827
0.7D2828
0.7D2833
0.7D2833
0.51D2834
0.92D2840
0.71D2841
0.73D2841
0.73D2841
0.71D2843
0.79D2846
0.76D2847
0.54D2848
0.75D2848
0.66D2851
0.79D2853
0.79D2853
0.74D2855
0.73D2858
0.71D2858
0.71D2858
0.7D2859
0.7D2859
0.7D2859
0.71D2860
0.71D2861
0.66D2861
0.7D2862
0.8D2862
0.71D2863
0.71D2863
0.71D2863

Showing the first 1000 rows.

Alternatively, one could just write the SQL statement in scala to create a new DataFrame diamondsDColoredDF_FromTable from the table diamonds and display it, as follows:

val diamondsDColoredDF_FromTable = sqlContext.sql("select carat, color, price from diamonds where color='D'") // Shift+Enter
diamondsDColoredDF_FromTable: org.apache.spark.sql.DataFrame = [carat: double, color: string ... 1 more field]
// or if you like use upper case for SQL then this is equivalent
val diamondsDColoredDF_FromTable = sqlContext.sql("SELECT carat, color, price FROM diamonds WHERE color='D'") // Shift+Enter
diamondsDColoredDF_FromTable: org.apache.spark.sql.DataFrame = [carat: double, color: string ... 1 more field]
// from version 2.x onwards you can call from SparkSession, the pre-made spark in spark-shell or databricks notebook
val diamondsDColoredDF_FromTable = spark.sql("SELECT carat, color, price FROM diamonds WHERE color='D'") // Shift+Enter
diamondsDColoredDF_FromTable: org.apache.spark.sql.DataFrame = [carat: double, color: string ... 1 more field]
display(diamondsDColoredDF_FromTable) // Ctrl+Enter to see the same DF!
0.23D357
0.23D402
0.26D403
0.26D403
0.26D403
0.22D404
0.3D552
0.3D552
0.3D552
0.24D553
0.26D554
0.26D554
0.26D554
0.75D2760
0.71D2762
0.61D2763
0.71D2764
0.71D2764
0.7D2767
0.71D2767
0.73D2768
0.7D2768
0.71D2768
0.71D2770
0.76D2770
0.73D2770
0.75D2773
0.7D2773
0.7D2777
0.53D2782
0.75D2782
0.72D2782
0.72D2782
0.7D2782
0.64D2787
0.71D2788
0.72D2795
0.71D2797
0.71D2797
0.71D2797
0.51D2797
0.78D2799
0.91D2803
0.7D2804
0.7D2804
0.72D2804
0.72D2804
0.73D2808
0.81D2809
0.74D2810
0.83D2811
0.71D2812
0.55D2815
0.71D2816
0.73D2821
0.71D2822
0.71D2822
0.7D2822
0.7D2822
0.71D2822
0.7D2822
0.7D2822
0.7D2822
0.7D2822
0.79D2823
0.71D2824
0.7D2826
0.7D2827
0.72D2827
0.7D2828
0.7D2833
0.7D2833
0.51D2834
0.92D2840
0.71D2841
0.73D2841
0.73D2841
0.71D2843
0.79D2846
0.76D2847
0.54D2848
0.75D2848
0.66D2851
0.79D2853
0.79D2853
0.74D2855
0.73D2858
0.71D2858
0.71D2858
0.7D2859
0.7D2859
0.7D2859
0.71D2860
0.71D2861
0.66D2861
0.7D2862
0.8D2862
0.71D2863
0.71D2863
0.71D2863

Showing the first 1000 rows.

// You can also use the familiar wildchard character '%' when matching Strings
display(spark.sql("SELECT * FROM diamonds WHERE clarity LIKE 'V%'"))
0.23GoodEVS156.9653274.054.072.31
0.29PremiumIVS262.4583344.24.232.63
0.24Very GoodJVVS262.8573363.943.962.48
0.24Very GoodIVVS162.3573363.953.982.47
0.22FairEVS265.1613373.873.782.49
0.23Very GoodHVS159.46133844.052.39
0.23IdealJVS162.8563403.933.92.46
0.23Very GoodEVS263.8553523.853.922.48
0.23Very GoodHVS161573533.943.962.41
0.23Very GoodGVVS260.4583543.974.012.41
0.24PremiumIVS162.5573553.973.942.47
0.3Very GoodJVS262.2573574.284.32.67
0.23Very GoodDVS260.5613573.963.972.4
0.23Very GoodFVS160.9573573.963.992.42
0.23Very GoodFVS1605740244.032.41
0.23Very GoodFVS159.8574024.044.062.42
0.23Very GoodEVS160.7594023.974.012.42
0.23Very GoodEVS159.5584024.014.062.4
0.23Very GoodDVS161.9584023.923.962.44
0.23GoodFVS158.2594024.064.082.37
0.23GoodEVS164.1594023.833.852.46
0.26Very GoodDVS260.8594034.134.162.52
0.26GoodDVS265.2564033.994.022.61
0.26GoodDVS158.4634034.194.242.46
0.25Very GoodEVS263.36040444.032.54
0.23IdealGVS161.9544043.933.952.44
0.22PremiumEVS261.6584043.933.892.41
0.22PremiumDVS259.3624043.913.882.31
0.35IdealIVS160.9575524.544.592.78
0.28IdealGVVS261.4565534.194.222.58
0.32IdealIVVS16255.35534.394.422.73
0.24PremiumEVVS160.7585534.014.032.44
0.24Very GoodDVVS161.5605533.9742.45
0.26Very GoodFVVS259.2605544.194.222.49
0.26Very GoodEVVS259.9585544.154.232.51
0.26Very GoodDVVS262.4545544.084.132.56
0.26Very GoodDVVS262.8605544.014.052.53
0.26Very GoodEVVS162.6595544.064.092.55
0.26Very GoodEVVS163.45955444.042.55
0.26Very GoodDVVS162.1605544.034.122.53
0.26IdealEVVS262.9585544.024.062.54
0.26GoodEVVS157.9605544.224.252.45
0.24PremiumGVVS162.3595543.953.922.45
0.24PremiumHVVS161.2585544.013.962.44
0.24PremiumHVVS160.8595544.0242.44
0.24PremiumHVVS260.7585544.074.042.46
0.7IdealGVS261.65627575.75.673.5
0.71Very GoodEVS262.45727595.685.733.56
0.7GoodEVS257.55827595.855.93.38
0.7GoodFVS159.46227595.715.763.4
0.75PremiumGVS261.75827605.855.793.59
0.8IdealIVS162.95627605.945.873.72
0.74IdealIVVS262.35527615.775.813.61
0.59IdealEVVS2625527615.385.433.35
0.9PremiumIVS2635827616.166.123.87
0.73IdealFVS262.65627625.775.743.6
0.73IdealFVS262.75327625.85.753.62
0.71IdealGVS262.45427625.725.763.58
0.7IdealEVS260.75827625.735.763.49
0.7Very GoodFVS261.76327625.645.613.47
0.7FairFVS264.55727625.575.533.58
0.7FairFVS265.35527625.635.583.66
0.7PremiumFVS261.66027625.655.593.46
0.61Very GoodDVVS259.65727635.565.583.32
0.77IdealHVS2625627635.895.863.64
0.7Very GoodEVS262.66027655.625.653.53
0.77Very GoodHVS161.36027655.885.93.61
0.63PremiumEVVS160.96027655.525.553.37
0.71Very GoodFVS160.16227655.745.773.46
0.71PremiumFVS161.85927655.695.733.53
0.64IdealGVVS161.95627665.535.563.43
0.71PremiumGVS260.95727665.785.753.51
0.71PremiumGVS259.85627665.895.813.5
0.7Very GoodDVS261.85527675.685.723.52
0.7Very GoodFVS1605727675.85.873.5
0.7GoodHVVS262.16427675.625.653.5
0.71Very GoodGVS163.35927685.525.613.52
0.71PremiumDVS262.56027705.655.613.52
0.73PremiumGVS261.45927705.835.763.56
0.73PremiumGVS260.75827705.875.823.55
0.73PremiumGVS161.55827705.795.753.55
0.73PremiumGVS259.25927705.925.873.49
0.72Very GoodHVVS260.35627715.815.833.51
0.71IdealGVS261.95727715.735.773.56
0.73Very GoodHVVS160.45927725.835.893.54
0.58IdealGVVS161.55527725.395.443.33
0.58IdealFVVS161.75627725.335.373.3
0.71GoodEVS259.26127725.85.883.46
0.7PremiumDVS2586227735.875.783.38
0.6IdealEVS161.75527745.415.443.35
0.83GoodIVS264.65427745.855.883.79
0.74Very GoodFVS261.36127755.85.843.57
0.72Very GoodGVS263.756.427765.625.693.61
0.71PremiumEVS262.75827765.745.683.58
0.71IdealEVS262.25727765.795.623.55
0.54IdealEVVS261.65627765.255.273.24
0.54IdealEVVS261.55727765.245.263.23
0.72GoodGVS259.760.527765.85.843.47
0.7Very GoodDVS162.75827775.665.733.57
0.71PremiumFVS262.15827775.675.73.53

Showing the first 1000 rows.

// Combining conditions
display(spark.sql("SELECT * FROM diamonds WHERE clarity LIKE 'V%' AND price > 10000"))
1.7IdealJVS260.558100027.737.744.68
1.03IdealEVVS260.659100036.56.533.95
1.23Very GoodGVVS260.655100046.937.024.23
1.25IdealFVS261.655100066.936.964.28
1.21Very GoodFVS162.358100096.766.854.24
1.51PremiumIVS259.960100107.427.364.43
1.05IdealFVVS260.555100116.676.584.01
1.6IdealJVS16253100117.577.564.69
1.35PremiumGVS162.159100127.067.024.37
1.53PremiumIVS26258100137.367.414.58
1.13IdealFVS160.957100166.736.764.11
1.21PremiumFVS162.659100186.816.764.25
1.01Very GoodFVVS162.957100196.356.414.01
1.04IdealEVVS262.955100196.476.514.08
1.26Very GoodGVVS260.956100206.957.014.25
1.5Very GoodHVS260.959100237.377.434.51
1.12PremiumFVVS262.459100286.586.664.13
1.27PremiumFVS160.358100287.067.044.25
1.52Very GoodIVS162.959.9100327.277.314.59
1.24PremiumFVS162.558100336.876.834.28
1.23Very GoodFVS16259100356.846.874.25
1.5GoodGVS163.657100367.237.144.57
1.22IdealGVVS262.356100386.816.844.25
1.3IdealGVS16255100386.987.024.34
1.59PremiumIVS260.260100397.587.614.57
1.83PremiumIVS260.560100437.937.864.78
1.07IdealEVVS261.456100436.656.554.05
1.51Very GoodHVS161.554100457.347.424.54
1.08IdealFVVS261.657100466.576.64.06
1PremiumDVVS261.660100466.416.363.93
1.03IdealFVVS261.157100496.516.543.99
1.52Very GoodIVS262.358100517.327.284.55
1.08IdealFVVS262.155100526.576.64.09
1.2PremiumGVVS262.859100536.726.654.2
1.2PremiumEVS160.757100536.896.814.16
1.2PremiumGVVS261.258100536.886.844.2
1.71PremiumIVS160.362100557.767.74.66
1IdealFVVS162.353100586.376.433.99
1.07IdealFVVS262.357100616.566.584.09
1.66PremiumJVVS262.659100627.587.544.73
1.2PremiumFVVS260.560100646.986.874.19
1.11Very GoodFVVS162.559100696.596.634.13
1.34IdealGVS162.757100707.17.044.43
1.31PremiumGVS161.559100717.0674.32
1.31IdealGVS162.256100717.057.014.37
1.31IdealGVS161.557100717.067.024.33
1.53Very GoodHVS159.563100767.517.444.45
1.26PremiumFVS162.758100766.936.864.32
1.73IdealJVS26357100767.647.64.8
1.19IdealDVS161.157100796.846.874.19
1.5IdealIVS161.357100807.357.324.5
1.5PremiumIVS162.759100807.37.254.56
1.5IdealHVS161.355100807.377.344.51
1.21PremiumDVS160.259100836.896.864.14
1.71PremiumHVS259.261100847.837.774.62
1.82Very GoodJVS162.256100907.837.964.91
1.51Very GoodHVS261.957100907.327.364.54
1.3IdealFVS262.256100906.986.944.33
1.3PremiumFVS260.459100907.127.064.28
1.5Very GoodIVVS263.358100907.277.244.59
1.57IdealIVS261.556100937.567.494.63
1.07IdealFVVS260.355100936.656.684.02
1.31Very GoodEVS263.156100946.956.94.37
1.33GoodGVS162.860100966.876.924.33
1.53PremiumIVS161.259100987.397.414.53
1.61IdealIVS262.557100987.497.434.66
1.31IdealGVS161.956100997.037.134.38
1.22IdealFVS162.357101006.836.794.24
1.07IdealEVVS261.757101046.556.614.06
1.59Very GoodIVS260.563101067.527.454.53
1.22PremiumGVVS26258101116.96.854.26
1.09PremiumEVVS259.959101116.736.74.02
1.58Very GoodIVS161.857101127.57.564.64
1Very GoodDVVS261.758101136.376.413.94
1.23IdealGVVS163.256101136.786.834.3
1.25IdealDVS262.656101146.876.844.29
1.17PremiumDVS161.759101156.776.724.16
1.28IdealGVS162.157101266.916.944.3
1.43IdealHVVS261.654101297.257.294.48
1.51GoodHVS159.961101297.347.394.41
1.52Very GoodIVS261.755101307.397.324.54
1.04Very GoodDVVS260.858101306.496.533.96
1.07IdealEVVS262.356101336.516.614.09
1.5GoodFVS26456101347.187.134.64
1PremiumEVVS160.354101346.596.473.94
1.21PremiumEVS160.358101376.956.914.18
1.24IdealFVS161.554101386.936.894.25
1.24IdealFVS160.954101386.986.954.24
1.11Very GoodFVVS159.755101416.776.824.06
1.1IdealDVS161.956101446.586.614.09
1.01PremiumDVVS260.258101476.576.513.94
1.31IdealGVS160.557101557.17.144.31
1.2PremiumDVS261.158101616.856.834.18
1.5Very GoodIVS162.259101647.277.34.53
1.54PremiumIVS161.658101647.397.424.56
1.54GoodIVS163.660101647.37.334.65
1.5IdealIVS16254101647.327.384.56
1.67Very GoodIVS260.760101657.617.684.64
1.7Very GoodJVS162.958101657.547.674.79
1.53IdealIVS160.260101717.517.484.51

Showing the first 1000 rows.

// selecting a subset of fields
display(spark.sql("SELECT carat, clarity, price FROM diamonds WHERE color = 'D'"))
0.23VS2357
0.23VS1402
0.26VS2403
0.26VS2403
0.26VS1403
0.22VS2404
0.3SI1552
0.3SI1552
0.3SI1552
0.24VVS1553
0.26VVS2554
0.26VVS2554
0.26VVS1554
0.75SI12760
0.71SI22762
0.61VVS22763
0.71SI12764
0.71SI12764
0.7VS22767
0.71SI22767
0.73SI12768
0.7SI12768
0.71SI22768
0.71VS22770
0.76SI22770
0.73SI22770
0.75SI22773
0.7VS22773
0.7VS12777
0.53VVS22782
0.75SI22782
0.72SI12782
0.72SI12782
0.7SI12782
0.64VS12787
0.71VS22788
0.72SI22795
0.71SI12797
0.71SI12797
0.71SI12797
0.51VVS12797
0.78SI12799
0.91SI22803
0.7SI12804
0.7SI12804
0.72SI12804
0.72SI12804
0.73SI12808
0.81SI22809
0.74SI22810
0.83SI12811
0.71SI12812
0.55VVS12815
0.71VS12816
0.73SI12821
0.71SI12822
0.71SI12822
0.7SI12822
0.7SI12822
0.71SI12822
0.7SI12822
0.7SI12822
0.7SI12822
0.7SI12822
0.79SI22823
0.71VS22824
0.7VS22826
0.7SI12827
0.72VS22827
0.7SI22828
0.7VS22833
0.7VS22833
0.51VVS12834
0.92SI22840
0.71VS12841
0.73SI12841
0.73SI12841
0.71SI12843
0.79SI12846
0.76SI12847
0.54VVS22848
0.75SI22848
0.66VS12851
0.79SI22853
0.79SI22853
0.74VS22855
0.73SI12858
0.71VS22858
0.71VS22858
0.7VS22859
0.7VS22859
0.7VS22859
0.71VS12860
0.71SI12861
0.66VS12861
0.7SI12862
0.8SI22862
0.71SI12863
0.71SI12863
0.71SI12863

Showing the first 1000 rows.

//renaming a field using as
display(spark.sql("SELECT carat AS carrot, clarity, price FROM diamonds"))
0.23SI2326
0.21SI1326
0.23VS1327
0.29VS2334
0.31SI2335
0.24VVS2336
0.24VVS1336
0.26SI1337
0.22VS2337
0.23VS1338
0.3SI1339
0.23VS1340
0.22SI1342
0.31SI2344
0.2SI2345
0.32I1345
0.3SI2348
0.3SI1351
0.3SI1351
0.3SI1351
0.3SI2351
0.23VS2352
0.23VS1353
0.31SI1353
0.31SI1353
0.23VVS2354
0.24VS1355
0.3VS2357
0.23VS2357
0.23VS1357
0.23VS1402
0.23VS1402
0.23VS1402
0.23VS1402
0.23VS1402
0.23VS1402
0.23VS1402
0.31SI1402
0.26VS2403
0.33SI2403
0.33SI2403
0.33SI1403
0.26VS2403
0.26VS1403
0.32SI2403
0.29SI1403
0.32SI2403
0.32SI2403
0.25VS2404
0.29SI2404
0.24SI1404
0.23VS1404
0.32SI1404
0.22VS2404
0.22VS2404
0.3SI2405
0.3SI2405
0.3SI1405
0.3SI1405
0.3SI1405
0.35VS1552
0.3SI1552
0.3SI1552
0.3SI1552
0.42SI2552
0.28VVS2553
0.32VVS1553
0.31SI1553
0.31SI1553
0.24VVS1553
0.24VVS1553
0.3SI1554
0.3SI1554
0.3SI1554
0.3SI1554
0.26VVS2554
0.26VVS2554
0.26VVS2554
0.26VVS2554
0.26VVS1554
0.26VVS1554
0.26VVS1554
0.26VVS2554
0.38SI2554
0.26VVS1554
0.24VVS1554
0.24VVS1554
0.24VVS1554
0.24VVS2554
0.32SI1554
0.7SI12757
0.86SI22757
0.7VS22757
0.71VS22759
0.78SI22759
0.7VS22759
0.7VS12759
0.96SI22759
0.73SI12760
0.8SI12760

Showing the first 1000 rows.

//sorting
display(spark.sql("SELECT carat, clarity, price FROM diamonds ORDER BY price DESC"))
2.29VS218823
2SI118818
1.51IF18806
2.07SI218804
2SI118803
2.29SI118797
2VS118795
2.04SI118795
1.71VS218791
2.15SI218791
2.8SI218788
2.05SI118787
2.05SI218784
2.03SI118781
1.6VS118780
2.06VS218779
1.51VVS118777
1.71VVS218768
2.55VS118766
2.08SI118760
2SI118759
2.03SI118757
2.61SI218756
2.36SI218745
2.01SI118741
2.01SI118741
2.01SI118741
2.01SI118736
1.94SI118735
2.02SI118731
1.72VVS218730
1.51VS118729
1.7VVS218718
2.18SI118717
3.01SI218710
3.01SI218710
2SI118709
2.07VS218707
2.22VS118706
2.01SI218705
3.51VS218701
1.28IF18700
2.02VS218700
2.19SI218693
2.43VS218692
2.48SI218692
1.5VS218691
2.67SI218686
1.42VVS118682
2.03VS218680
2.02SI218678
2.16SI218678
2.01SI218674
2.04SI118663
2.05VS218659
2.12SI118656
2.29VS218653
2.1SI118648
2.01VS218640
2.09SI218640
2.03SI118630
2.01SI118625
2.42VS218615
1.49VVS218614
2.07SI218611
2.01VS218607
2SI118604
1.71VVS218599
1.7VS118598
2.29IF18594
3.01SI218593
2.03SI218578
2.11SI218575
2.01SI118574
2.01SI118572
1.6VS118571
2.02VS218565
2.01VS218561
2.01VS218561
2.09SI118559
3.04SI218559
2.38VS218559
1.72VS218557
1.5IF18552
1.04IF18542
2.4SI118541
2.4SI218541
2.03SI218535
2.32SI218532
2.22VS218531
4.5I118531
2.14SI118528
2.14SI218526
1.83VS218525
2SI118524
2.38VS118522
2VS218515
2.09SI218509
2.32SI218508
2.37SI118508

Showing the first 1000 rows.

diamondsDF.printSchema // since price is double in the DF that was turned into table we can rely on the descenting sort on doubles
root |-- carat: double (nullable = true) |-- cut: string (nullable = true) |-- color: string (nullable = true) |-- clarity: string (nullable = true) |-- depth: double (nullable = true) |-- table: double (nullable = true) |-- price: double (nullable = true) |-- x: double (nullable = true) |-- y: double (nullable = true) |-- z: double (nullable = true)
// sort by multiple fields
display(spark.sql("SELECT carat, clarity, price FROM diamonds ORDER BY carat ASC, price DESC"))
0.2VS2367
0.2VS2367
0.2VS2367
0.2VS2367
0.2VS2367
0.2VS2367
0.2VS2367
0.2VS2367
0.2VS2367
0.2VS2367
0.2VS2367
0.2SI2345
0.21SI2394
0.21VS2386
0.21VS2386
0.21VS2386
0.21VS2386
0.21VS2386
0.21VS2386
0.21VS2386
0.21SI1326
0.22SI1470
0.22VS2404
0.22VS2404
0.22SI1342
0.22VS2337
0.23VVS2688
0.23VVS1682
0.23VVS1680
0.23VVS1680
0.23VVS1680
0.23VVS2680
0.23VVS2680
0.23VVS2680
0.23VVS2680
0.23VVS2650
0.23VVS2640
0.23VVS1640
0.23VS1611
0.23VVS2600
0.23VS1586
0.23VS1586
0.23VVS2583
0.23VVS2583
0.23VVS1583
0.23VVS1583
0.23VVS1583
0.23VVS1583
0.23VVS1583
0.23VVS2583
0.23VS2577
0.23VVS2571
0.23VVS2550
0.23VVS2549
0.23VS2548
0.23VS1548
0.23VS1548
0.23VS2548
0.23VS2543
0.23VVS2538
0.23VVS2537
0.23IF536
0.23VVS1536
0.23IF536
0.23VVS1536
0.23IF536
0.23VVS1536
0.23VVS1531
0.23VVS1530
0.23VVS2530
0.23VVS2530
0.23VVS2530
0.23VVS2530
0.23VVS2530
0.23VVS2530
0.23VVS2530
0.23VVS2530
0.23VVS2530
0.23VVS2530
0.23VVS2530
0.23VVS2530
0.23VVS2530
0.23VVS2530
0.23VVS2530
0.23VVS2530
0.23VVS2530
0.23VVS2530
0.23VVS2530
0.23VVS2530
0.23VVS2530
0.23VVS2530
0.23VVS2530
0.23VVS2530
0.23VVS2530
0.23VVS1530
0.23VVS1530
0.23VVS1530
0.23VVS1530
0.23VVS1530
0.23VVS1530

Showing the first 1000 rows.