Power Plant ML Pipeline Application

This is an end-to-end example of using a number of different machine learning algorithms to solve a supervised regression problem.

Step 1: Business Understanding
Step 2: Load Your Data
Step 3: Explore Your Data
Step 4: Visualize Your Data
Step 5: Data Preparation
Step 6: Data Modeling
Step 7: Tuning and Evaluation
Step 8: Deployment

We are trying to predict power output given a set of readings from various sensors in a gas-fired power generation plant. Power generation is a complex process, and understanding and predicting power output is an important element in managing a plant and its connection to the power grid.

More information about Peaker or Peaking Power Plants can be found on Wikipedia https://en.wikipedia.org/wiki/Peaking_power_plant

Given this business problem, we need to translate it to a Machine Learning task. The ML task is regression since the label (or target) we are trying to predict is numeric.

The example data is provided by UCI at UCI Machine Learning Repository Combined Cycle Power Plant Data Set

You can read the background on the UCI page, but in summary we have collected a number of readings from sensors at a Gas Fired Power Plant

(also called a Peaker Plant) and now we want to use those sensor readings to predict how much power the plant will generate.

More information about Machine Learning with Spark can be found in the Spark MLLib Programming Guide

Please note this example only works with Spark version 1.4 or higher

To Rerun Steps 1-4 done in the notebook at:

Workspace -> scalable-data-science -> sds-2-2 -> 009_PowerPlantPipeline_01ETLEDA

just run the following command as shown in the cell below:

  %run "/scalable-data-science/sds-2-x/009_PowerPlantPipeline_01ETLEDA"

Note: If you already evaluated the %run ... command above then:
- first delete the cell by pressing on x on the top-right corner of the cell and
- revaluate the run command above.

%run "/scalable-data-science/sds-2-x/009_PowerPlantPipeline_01ETLEDA"

res2: Int = 240


dbfs:/databricks-datasets/power-plant/data/Sheet1.tsv	Sheet1.tsv	308693
dbfs:/databricks-datasets/power-plant/data/Sheet2.tsv	Sheet2.tsv	308693
dbfs:/databricks-datasets/power-plant/data/Sheet3.tsv	Sheet3.tsv	308693
dbfs:/databricks-datasets/power-plant/data/Sheet4.tsv	Sheet4.tsv	308693
dbfs:/databricks-datasets/power-plant/data/Sheet5.tsv	Sheet5.tsv	308693

powerPlantRDD: org.apache.spark.rdd.RDD[String] = /databricks-datasets/power-plant/data/Sheet1.tsv MapPartitionsRDD[10575] at textFile at command-45638284503054:1

AT V AP RH PE 14.96 41.76 1024.07 73.17 463.26 25.18 62.96 1020.04 59.08 444.37 5.11 39.4 1012.16 92.14 488.56 20.86 57.32 1010.24 76.64 446.48

powerPlantDF: org.apache.spark.sql.DataFrame = [AT: double, V: double ... 3 more fields]

res9: Long = 9568

+-----+-----+-------+-----+------+ | AT| V| AP| RH| PE| +-----+-----+-------+-----+------+ |14.96|41.76|1024.07|73.17|463.26| |25.18|62.96|1020.04|59.08|444.37| | 5.11| 39.4|1012.16|92.14|488.56| |20.86|57.32|1010.24|76.64|446.48| |10.82| 37.5|1009.23|96.62| 473.9| |26.27|59.44|1012.23|58.77|443.67| |15.89|43.96|1014.02|75.24|467.35| | 9.48|44.71|1019.12|66.43|478.42| |14.64| 45.0|1021.78|41.25|475.98| |11.74|43.56|1015.14|70.72| 477.5| +-----+-----+-------+-----+------+ only showing top 10 rows

res12: Long = 9568


14.96	41.76	1024.07	73.17	463.26
25.18	62.96	1020.04	59.08	444.37
5.11	39.4	1012.16	92.14	488.56
20.86	57.32	1010.24	76.64	446.48
10.82	37.5	1009.23	96.62	473.9
26.27	59.44	1012.23	58.77	443.67
15.89	43.96	1014.02	75.24	467.35
9.48	44.71	1019.12	66.43	478.42
14.64	45	1021.78	41.25	475.98
11.74	43.56	1015.14	70.72	477.5
17.99	43.72	1008.64	75.04	453.02
20.14	46.93	1014.66	64.22	453.99
24.34	73.5	1011.31	84.15	440.29
25.71	58.59	1012.77	61.83	451.28
26.19	69.34	1009.48	87.59	433.99
21.42	43.79	1015.76	43.08	462.19
18.21	45	1022.86	48.84	467.54
11.04	41.74	1022.6	77.51	477.2
14.45	52.75	1023.97	63.59	459.85
13.97	38.47	1015.15	55.28	464.3
17.76	42.42	1009.09	66.26	468.27
5.41	40.07	1019.16	64.77	495.24
7.76	42.28	1008.52	83.31	483.8
27.23	63.9	1014.3	47.19	443.61
27.36	48.6	1003.18	54.93	436.06
27.47	70.72	1009.97	74.62	443.25
14.6	39.31	1011.11	72.52	464.16
7.91	39.96	1023.57	88.44	475.52
5.81	35.79	1012.14	92.28	484.41
30.53	65.18	1012.69	41.85	437.89
23.87	63.94	1019.02	44.28	445.11
26.09	58.41	1013.64	64.58	438.86
29.27	66.85	1011.11	63.25	440.98
27.38	74.16	1010.08	78.61	436.65
24.81	63.94	1018.76	44.51	444.26
12.75	44.03	1007.29	89.46	465.86
24.66	63.73	1011.4	74.52	444.37
16.38	47.45	1010.08	88.86	450.69
13.91	39.35	1014.69	75.51	469.02
23.18	51.3	1012.04	78.64	448.86
22.47	47.45	1007.62	76.65	447.14
13.39	44.85	1017.24	80.44	469.18
9.28	41.54	1018.33	79.89	482.8
11.82	42.86	1014.12	88.28	476.7
10.27	40.64	1020.63	84.6	474.99
22.92	63.94	1019.28	42.69	444.22
16	37.87	1020.24	78.41	461.33
21.22	43.43	1010.96	61.07	448.06
13.46	44.71	1014.51	50	474.6
9.39	40.11	1029.14	77.29	473.05
31.07	73.5	1010.58	43.66	432.06
12.82	38.62	1018.71	83.8	467.41
32.57	78.92	1011.6	66.47	430.12
8.11	42.18	1014.82	93.09	473.62
13.92	39.39	1012.94	80.52	471.81
23.04	59.43	1010.23	68.99	442.99
27.31	64.44	1014.65	57.27	442.77
5.91	39.33	1010.18	95.53	491.49
25.26	61.08	1013.68	71.72	447.46
27.97	58.84	1002.25	57.88	446.11
26.08	52.3	1007.03	63.34	442.44
29.01	65.71	1013.61	48.07	446.22
12.18	40.1	1016.67	91.87	471.49
13.76	45.87	1008.89	87.27	463.5
25.5	58.79	1016.02	64.4	440.01
28.26	65.34	1014.56	43.4	441.03
21.39	62.96	1019.49	72.24	452.68
7.26	40.69	1020.43	90.22	474.91
10.54	34.03	1018.71	74	478.77
27.71	74.34	998.14	71.85	434.2
23.11	68.3	1017.83	86.62	437.91
7.51	41.01	1024.61	97.41	477.61
26.46	74.67	1016.65	84.44	431.65
29.34	74.34	998.58	81.55	430.57
10.32	42.28	1008.82	75.66	481.09
22.74	61.02	1009.56	79.41	445.56
13.48	39.85	1012.71	58.91	475.74
25.52	69.75	1010.36	90.06	435.12
21.58	67.25	1017.39	79	446.15
27.66	76.86	1001.31	69.47	436.64
26.96	69.45	1013.89	51.47	436.69
12.29	42.18	1016.53	83.13	468.75
15.86	43.02	1012.18	40.33	466.6
13.87	45.08	1024.42	81.69	465.48
24.09	73.68	1014.93	94.55	441.34
20.45	69.45	1012.53	91.81	441.83
15.07	39.3	1019	63.62	464.7
32.72	69.75	1009.6	49.35	437.99
18.23	58.96	1015.55	69.61	459.12
35.56	68.94	1006.56	38.75	429.69
18.36	51.43	1010.57	90.17	459.8
26.35	64.05	1009.81	81.24	433.63
25.92	60.95	1014.62	48.46	442.84
8.01	41.66	1014.49	76.72	485.13
19.63	52.72	1025.09	51.16	459.12
20.02	67.32	1012.05	76.34	445.31
10.08	40.72	1022.7	67.3	480.8
27.23	66.48	1005.23	52.38	432.55
23.37	63.77	1013.42	76.44	443.86
18.74	59.21	1018.3	91.55	449.77

Showing the first 1000 rows.


AT	double	null
V	double	null
AP	double	null
RH	double	null
PE	double	null


count	9568	9568	9568	9568	9568
mean	19.65123118729102	54.30580372073601	1013.2590781772603	73.30897784280926	454.3650094063554
stddev	7.4524732296110825	12.707892998326784	5.938783705811581	14.600268756728964	17.066994999803402
min	1.81	25.36	992.89	25.56	420.26
max	37.11	81.56	1033.3	100.16	495.76

//Let's quickly recall the schema and make sure our table is here now
table("power_plant_table").printSchema

powerPlantDF // make sure we have the DataFrame too

res24: org.apache.spark.sql.DataFrame = [AT: double, V: double ... 3 more fields]

import org.apache.spark.ml.feature.VectorAssembler

// make a DataFrame called dataset from the table
val dataset = sqlContext.table("power_plant_table") 

val vectorizer =  new VectorAssembler()
                      .setInputCols(Array("AT", "V", "AP", "RH"))
                      .setOutputCol("features")

import org.apache.spark.ml.feature.VectorAssembler dataset: org.apache.spark.sql.DataFrame = [AT: double, V: double ... 3 more fields] vectorizer: org.apache.spark.ml.feature.VectorAssembler = vecAssembler_a6baa233f655

// First let's hold out 20% of our data for testing and leave 80% for training
var Array(split20, split80) = dataset.randomSplit(Array(0.20, 0.80), 1800009193L)

split20: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [AT: double, V: double ... 3 more fields] split80: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [AT: double, V: double ... 3 more fields]

// Let's cache these datasets for performance
val testSet = split20.cache()
val trainingSet = split80.cache()

testSet: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [AT: double, V: double ... 3 more fields] trainingSet: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [AT: double, V: double ... 3 more fields]

testSet.count() // action to actually cache

res29: Long = 1966

trainingSet.count() // action to actually cache

res30: Long = 7602

dataset.take(3)

res31: Array[org.apache.spark.sql.Row] = Array([14.96,41.76,1024.07,73.17,463.26], [25.18,62.96,1020.04,59.08,444.37], [5.11,39.4,1012.16,92.14,488.56])

testSet.take(3)

res32: Array[org.apache.spark.sql.Row] = Array([1.81,39.42,1026.92,76.97,490.55], [3.2,41.31,997.67,98.84,489.86], [3.38,41.31,998.79,97.76,489.11])

trainingSet.take(3)

res33: Array[org.apache.spark.sql.Row] = Array([2.34,39.42,1028.47,69.68,490.34], [2.58,39.42,1028.68,69.03,488.69], [2.64,39.64,1011.02,85.24,481.29])

// ***** LINEAR REGRESSION MODEL ****

import org.apache.spark.ml.regression.LinearRegression
import org.apache.spark.ml.regression.LinearRegressionModel
import org.apache.spark.ml.Pipeline

// Let's initialize our linear regression learner
val lr = new LinearRegression()

import org.apache.spark.ml.regression.LinearRegression import org.apache.spark.ml.regression.LinearRegressionModel import org.apache.spark.ml.Pipeline lr: org.apache.spark.ml.regression.LinearRegression = linReg_ba1ada380272

// We use explain params to dump the parameters we can use
lr.explainParams()

res34: String = aggregationDepth: suggested depth for treeAggregate (>= 2) (default: 2) elasticNetParam: the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty (default: 0.0) epsilon: The shape parameter to control the amount of robustness. Must be > 1.0. (default: 1.35) featuresCol: features column name (default: features) fitIntercept: whether to fit an intercept term (default: true) labelCol: label column name (default: label) loss: The loss function to be optimized. Supported options: squaredError, huber. (Default squaredError) (default: squaredError) maxIter: maximum number of iterations (>= 0) (default: 100) predictionCol: prediction column name (default: prediction) regParam: regularization parameter (>= 0) (default: 0.0) solver: The solver algorithm for optimization. Supported options: auto, normal, l-bfgs. (Default auto) (default: auto) standardization: whether to standardize the training features before fitting the model (default: true) tol: the convergence tolerance for iterative algorithms (>= 0) (default: 1.0E-6) weightCol: weight column name. If this is not set or empty, we treat all instance weights as 1.0 (undefined)

// Now we set the parameters for the method
lr.setPredictionCol("Predicted_PE")
  .setLabelCol("PE")
  .setMaxIter(100)
  .setRegParam(0.1)
// We will use the new spark.ml pipeline API. If you have worked with scikit-learn this will be very familiar.
val lrPipeline = new Pipeline()
lrPipeline.setStages(Array(vectorizer, lr))
// Let's first train on the entire dataset to see what we get
val lrModel = lrPipeline.fit(trainingSet)

lrPipeline: org.apache.spark.ml.Pipeline = pipeline_97dfd6f4e2e5 lrModel: org.apache.spark.ml.PipelineModel = pipeline_97dfd6f4e2e5

Since Linear Regression is simply a line of best fit over the data that minimizes the square of the error, given multiple input dimensions we can express each predictor as a line function of the form:

$y = b_0 + b_1 x_1 + b_2 x_2 + b_3 x_3 + \ldots + b_i x_i + \ldots + b_k x_k$

where $b_0$ is the intercept and $b_i$ 's are coefficients.

To express the coefficients of that line we can retrieve the Estimator stage from the fitted, linear-regression pipeline model named lrModel and express the weights and the intercept for the function.

// The intercept is as follows:
val intercept = lrModel.stages(1).asInstanceOf[LinearRegressionModel].intercept

intercept: Double = 427.9139822165837

// The coefficents (i.e. weights) are as follows:

val weights = lrModel.stages(1).asInstanceOf[LinearRegressionModel].coefficients.toArray

weights: Array[Double] = Array(-1.9083064919040942, -0.25381293007161654, 0.08739350304730673, -0.1474651301033126)

val featuresNoLabel = dataset.columns.filter(col => col != "PE")

featuresNoLabel: Array[String] = Array(AT, V, AP, RH)

val coefficentFeaturePairs = sc.parallelize(weights).zip(sc.parallelize(featuresNoLabel))

coefficentFeaturePairs: org.apache.spark.rdd.RDD[(Double, String)] = ZippedPartitionsRDD2[10674] at zip at command-1805207615647213:1

coefficentFeaturePairs.collect() // this just pairs each coefficient with the name of its corresponding feature

res35: Array[(Double, String)] = Array((-1.9083064919040942,AT), (-0.25381293007161654,V), (0.08739350304730673,AP), (-0.1474651301033126,RH))

// Now let's sort the coefficients from the largest to the smallest

var equation = s"y = $intercept "
//var variables = Array
coefficentFeaturePairs.sortByKey().collect().foreach({
  case (weight, feature) =>
  { 
        val symbol = if (weight > 0) "+" else "-"
        val absWeight = Math.abs(weight)
        equation += (s" $symbol (${absWeight} * ${feature})")
  }
}
)

equation: String = y = 427.9139822165837 - (1.9083064919040942 * AT) - (0.25381293007161654 * V) - (0.1474651301033126 * RH) + (0.08739350304730673 * AP)

// Finally here is our equation
println("Linear Regression Equation: " + equation)

Linear Regression Equation: y = 427.9139822165837 - (1.9083064919040942 * AT) - (0.25381293007161654 * V) - (0.1474651301033126 * RH) + (0.08739350304730673 * AP)

Based on examining the fitted Linear Regression Equation above:

There is a strong negative correlation between Atmospheric Temperature (AT) and Power Output due to the coefficient being greater than -1.91.
But our other dimenensions seem to have little to no correlation with Power Output.

Do you remember Step 2: Explore Your Data? When we visualized each predictor against Power Output using a Scatter Plot, only the temperature variable seemed to have a linear correlation with Power Output so our final equation seems logical.

Now let's see what our predictions look like given this model.

val predictionsAndLabels = lrModel.transform(testSet)

display(predictionsAndLabels.select("AT", "V", "AP", "RH", "PE", "Predicted_PE"))


1.81	39.42	1026.92	76.97	490.55	492.8503868481024
3.2	41.31	997.67	98.84	489.86	483.9368120270272
3.38	41.31	998.79	97.76	489.11	483.850459922409
3.4	39.64	1011.1	83.43	459.86	487.4251507226833
3.51	35.47	1017.53	86.56	489.07	488.37401129434335
3.63	38.44	1016.16	87.38	487.87	487.1505396071426
3.91	35.47	1016.92	86.03	488.67	487.6355351796776
3.94	39.9	1008.06	97.49	488.81	483.9896378767201
4	39.9	1009.64	97.16	490.79	484.0618847149547
4.15	39.9	1007.62	95.69	489.8	483.8158776062654
4.15	39.9	1008.84	96.68	491.22	483.77650720118083
4.23	38.44	1016.46	76.64	489	487.61554926022393
4.24	39.9	1009.28	96.74	491.25	483.6343648504441
4.43	38.91	1019.04	88.17	491.9	485.6397981724803
4.44	38.44	1016.14	75.35	486.53	487.37706899378225
4.61	40.27	1012.32	77.28	492.85	485.96972834538735
4.65	35.19	1018.23	94.78	489.36	485.11862159667663
4.69	39.42	1024.58	79.35	486.34	486.79899634464203
4.73	41.31	999.77	93.44	486.6	481.99694115337115
4.77	39.33	1011.32	68.98	494.91	487.0395505377602
4.78	42.85	1013.39	93.36	481.47	482.7127506383782
4.83	38.44	1015.35	72.94	485.32	486.9191795580812
4.86	39.4	1012.73	91.39	488.63	483.6685673220653
4.89	45.87	1007.58	99.35	482.69	480.3452494934288
4.95	42.07	1004.87	80.88	485.67	483.6820847979367
4.96	39.4	1003.58	92.22	486.09	482.5556900620063
4.96	40.07	1011.8	67.38	494.75	486.76704382567345
5.07	40.07	1019.32	66.17	494.87	487.39276206190476
5.19	40.78	1025.24	95.07	482.46	483.2391853805797
5.24	38.68	1018.03	78.65	486.67	485.46804748846023
5.25	40.07	1019.48	67.7	495.23	486.8376282047915
5.28	42.07	1003.82	80.84	485.24	482.9664790826128
5.35	35.57	1027.12	80.81	488.65	486.52337424855034
5.41	45.87	1008.47	97.51	479.48	479.7020461747409
5.47	40.62	1018.66	83.61	481.56	483.8603707725907
5.52	39.33	1009.74	95.25	492.39	481.59632996620337
5.53	35.79	1013.48	94.19	484.25	482.9589094130443
5.65	40.72	1022.46	85.17	487.09	483.5935440196594
5.66	40	1022.08	93.03	475.54	482.56492081062197
5.66	40.62	1015.87	84.97	485.18	483.05341208868646
5.67	45.87	1008.91	93.29	478.44	479.8666424772226
5.7	40.62	1016.07	84.94	482.82	482.99898248352287
5.71	41.31	1003.24	89.48	485.83	481.0140181620884
5.72	40.81	1025.78	92.46	487.8	482.6522450331836
5.73	40.35	1012.24	91.84	490.5	481.65803626550104
5.76	45.87	1010.83	95.79	481.4	479.4940275935438
5.79	38.68	1017.19	70.46	487.4	485.55280779089935
5.8	45.87	1009.14	92.06	481.6	479.82004524900304
5.81	45.87	1009.63	94.38	479.66	479.5016658987375
5.82	40.78	1024.82	96.01	470.02	481.8616297971032
5.85	40.77	1022.44	84.77	480.59	483.25643025675544
5.89	39.48	1005.11	59.83	484.91	485.6707676138384
5.97	40.35	1012.3	94.1	489.03	480.8720151235934
5.98	39.61	1017.27	84.86	482.17	482.8376771392271
5.99	35.79	1011.56	91.69	484.82	482.28195572617585
6.01	35.79	1011.05	91.33	482.25	482.25230635662086
6.05	41.14	1027.69	86.93	481.02	482.9211493842233
6.06	41.17	1019.67	84.7	489.62	482.5224032770931
6.06	41.17	1019.67	84.7	489.62	482.5224032770931
6.07	41.14	1027.57	86.98	480.19	482.8651227775144
6.09	43.65	1020.95	71.15	485.96	483.9457142125588
6.13	40.81	1026.31	91.66	483.8	482.03413003220066
6.14	39.4	1011.21	90.87	485.94	481.1697787554499
6.17	36.25	1028.68	90.59	483.77	483.4800950250837
6.17	39.33	1012.57	93.32	491.54	480.887862061189
6.28	41.06	1020.96	90.91	489.79	481.3274744321715
6.28	43.02	1013.72	88.13	487.17	480.60722518885586
6.29	40.78	1024.75	96.37	478.29	480.90552075385773
6.34	40.64	1020.62	94.39	478.78	480.7766850294918
6.4	39.9	1007.75	86.55	486.03	480.8813804440216
6.41	40.81	1026.57	93.51	484.49	481.2497160345687
6.48	36.24	1013.62	92.03	484.65	481.36256219865294
6.48	40.27	1010.55	82.12	486.68	481.5327774754329
6.54	39.33	1011.54	93.69	491.16	480.0372112529075
6.57	39.37	1020.2	77.37	487.94	483.13326820062326
6.59	39.37	1020.34	77.92	488.17	483.026231339655
6.67	36.08	1022.31	83.51	486.52	483.05644648396395
6.67	39.37	1019.99	75.61	486.84	483.1836235447748
6.69	36.24	1013.35	91.09	483.82	481.07683881182743
6.71	40.72	1022.78	80.69	483.11	482.2593488420791
6.72	39.85	1011.84	84.66	489.09	480.91956153647465
6.75	39.37	1020.26	70.99	486.26	483.7358441723225
6.75	39.9	1008.3	87.42	484.05	480.13324493534134
6.76	36.25	1028.31	91.16	484.36	482.2378034745739
6.8	39.37	1020.24	73.29	487.33	483.29951117842876
6.81	37.49	1010.74	88.25	482.21	480.72127979674934
6.81	38.56	1016.5	70.99	487.45	483.4983346847084
6.82	41.03	1022.12	87.63	489.64	480.8896654047192
6.84	39.4	1011.9	93.75	484.09	479.4695661535221
6.86	40.02	1031.5	77.94	476.45	483.31837237370024
6.86	41.38	1021.35	90.78	487.49	480.1926904623461
6.86	42.49	1007.95	93.96	486.14	478.2709460554042
6.87	40.07	1017.91	57.64	491.4	485.0924630969619
6.89	39.37	1020.21	74.17	486.9	482.99537247457505
6.89	43.65	1019.87	72.77	484	482.0857905249771
6.91	36.08	1021.82	84.31	486.37	482.4376580053312
6.91	37.49	1011.05	82.07	481.88	481.46887563754206
6.93	40.67	1020.17	71.16	494.61	483.02945770729485
6.93	41.14	1027.18	84.67	479.06	481.53054017882704
6.94	38.91	1018.94	90.64	485.12	480.47697065614113

Showing the first 1000 rows.

//Now let's compute some evaluation metrics against our test dataset

import org.apache.spark.mllib.evaluation.RegressionMetrics 

val metrics = new RegressionMetrics(predictionsAndLabels.select("Predicted_PE", "PE").rdd.map(r => (r(0).asInstanceOf[Double], r(1).asInstanceOf[Double])))

import org.apache.spark.mllib.evaluation.RegressionMetrics metrics: org.apache.spark.mllib.evaluation.RegressionMetrics = org.apache.spark.mllib.evaluation.RegressionMetrics@7290c3ef

val rmse = metrics.rootMeanSquaredError

rmse: Double = 4.609375859170583

val explainedVariance = metrics.explainedVariance

explainedVariance: Double = 274.54186073318266

val r2 = metrics.r2

r2: Double = 0.9308377700269259

println (f"Root Mean Squared Error: $rmse")
println (f"Explained Variance: $explainedVariance")  
println (f"R2: $r2")

Root Mean Squared Error: 4.609375859170583 Explained Variance: 274.54186073318266 R2: 0.9308377700269259

display(predictionsAndLabels) // recall the DataFrame predictionsAndLabels


1.81	39.42	1026.92	76.97	490.55	[1,4,[],[1.81,39.42,1026.92,76.97]]	492.8503868481024
3.2	41.31	997.67	98.84	489.86	[1,4,[],[3.2,41.31,997.67,98.84]]	483.9368120270272
3.38	41.31	998.79	97.76	489.11	[1,4,[],[3.38,41.31,998.79,97.76]]	483.850459922409
3.4	39.64	1011.1	83.43	459.86	[1,4,[],[3.4,39.64,1011.1,83.43]]	487.4251507226833
3.51	35.47	1017.53	86.56	489.07	[1,4,[],[3.51,35.47,1017.53,86.56]]	488.37401129434335
3.63	38.44	1016.16	87.38	487.87	[1,4,[],[3.63,38.44,1016.16,87.38]]	487.1505396071426
3.91	35.47	1016.92	86.03	488.67	[1,4,[],[3.91,35.47,1016.92,86.03]]	487.6355351796776
3.94	39.9	1008.06	97.49	488.81	[1,4,[],[3.94,39.9,1008.06,97.49]]	483.9896378767201
4	39.9	1009.64	97.16	490.79	[1,4,[],[4,39.9,1009.64,97.16]]	484.0618847149547
4.15	39.9	1007.62	95.69	489.8	[1,4,[],[4.15,39.9,1007.62,95.69]]	483.8158776062654
4.15	39.9	1008.84	96.68	491.22	[1,4,[],[4.15,39.9,1008.84,96.68]]	483.77650720118083
4.23	38.44	1016.46	76.64	489	[1,4,[],[4.23,38.44,1016.46,76.64]]	487.61554926022393
4.24	39.9	1009.28	96.74	491.25	[1,4,[],[4.24,39.9,1009.28,96.74]]	483.6343648504441
4.43	38.91	1019.04	88.17	491.9	[1,4,[],[4.43,38.91,1019.04,88.17]]	485.6397981724803
4.44	38.44	1016.14	75.35	486.53	[1,4,[],[4.44,38.44,1016.14,75.35]]	487.37706899378225
4.61	40.27	1012.32	77.28	492.85	[1,4,[],[4.61,40.27,1012.32,77.28]]	485.96972834538735
4.65	35.19	1018.23	94.78	489.36	[1,4,[],[4.65,35.19,1018.23,94.78]]	485.11862159667663
4.69	39.42	1024.58	79.35	486.34	[1,4,[],[4.69,39.42,1024.58,79.35]]	486.79899634464203
4.73	41.31	999.77	93.44	486.6	[1,4,[],[4.73,41.31,999.77,93.44]]	481.99694115337115
4.77	39.33	1011.32	68.98	494.91	[1,4,[],[4.77,39.33,1011.32,68.98]]	487.0395505377602
4.78	42.85	1013.39	93.36	481.47	[1,4,[],[4.78,42.85,1013.39,93.36]]	482.7127506383782
4.83	38.44	1015.35	72.94	485.32	[1,4,[],[4.83,38.44,1015.35,72.94]]	486.9191795580812
4.86	39.4	1012.73	91.39	488.63	[1,4,[],[4.86,39.4,1012.73,91.39]]	483.6685673220653
4.89	45.87	1007.58	99.35	482.69	[1,4,[],[4.89,45.87,1007.58,99.35]]	480.3452494934288
4.95	42.07	1004.87	80.88	485.67	[1,4,[],[4.95,42.07,1004.87,80.88]]	483.6820847979367
4.96	39.4	1003.58	92.22	486.09	[1,4,[],[4.96,39.4,1003.58,92.22]]	482.5556900620063
4.96	40.07	1011.8	67.38	494.75	[1,4,[],[4.96,40.07,1011.8,67.38]]	486.76704382567345
5.07	40.07	1019.32	66.17	494.87	[1,4,[],[5.07,40.07,1019.32,66.17]]	487.39276206190476
5.19	40.78	1025.24	95.07	482.46	[1,4,[],[5.19,40.78,1025.24,95.07]]	483.2391853805797
5.24	38.68	1018.03	78.65	486.67	[1,4,[],[5.24,38.68,1018.03,78.65]]	485.46804748846023
5.25	40.07	1019.48	67.7	495.23	[1,4,[],[5.25,40.07,1019.48,67.7]]	486.8376282047915
5.28	42.07	1003.82	80.84	485.24	[1,4,[],[5.28,42.07,1003.82,80.84]]	482.9664790826128
5.35	35.57	1027.12	80.81	488.65	[1,4,[],[5.35,35.57,1027.12,80.81]]	486.52337424855034
5.41	45.87	1008.47	97.51	479.48	[1,4,[],[5.41,45.87,1008.47,97.51]]	479.7020461747409
5.47	40.62	1018.66	83.61	481.56	[1,4,[],[5.47,40.62,1018.66,83.61]]	483.8603707725907
5.52	39.33	1009.74	95.25	492.39	[1,4,[],[5.52,39.33,1009.74,95.25]]	481.59632996620337
5.53	35.79	1013.48	94.19	484.25	[1,4,[],[5.53,35.79,1013.48,94.19]]	482.9589094130443
5.65	40.72	1022.46	85.17	487.09	[1,4,[],[5.65,40.72,1022.46,85.17]]	483.5935440196594
5.66	40	1022.08	93.03	475.54	[1,4,[],[5.66,40,1022.08,93.03]]	482.56492081062197
5.66	40.62	1015.87	84.97	485.18	[1,4,[],[5.66,40.62,1015.87,84.97]]	483.05341208868646
5.67	45.87	1008.91	93.29	478.44	[1,4,[],[5.67,45.87,1008.91,93.29]]	479.8666424772226
5.7	40.62	1016.07	84.94	482.82	[1,4,[],[5.7,40.62,1016.07,84.94]]	482.99898248352287
5.71	41.31	1003.24	89.48	485.83	[1,4,[],[5.71,41.31,1003.24,89.48]]	481.0140181620884
5.72	40.81	1025.78	92.46	487.8	[1,4,[],[5.72,40.81,1025.78,92.46]]	482.6522450331836
5.73	40.35	1012.24	91.84	490.5	[1,4,[],[5.73,40.35,1012.24,91.84]]	481.65803626550104
5.76	45.87	1010.83	95.79	481.4	[1,4,[],[5.76,45.87,1010.83,95.79]]	479.4940275935438
5.79	38.68	1017.19	70.46	487.4	[1,4,[],[5.79,38.68,1017.19,70.46]]	485.55280779089935
5.8	45.87	1009.14	92.06	481.6	[1,4,[],[5.8,45.87,1009.14,92.06]]	479.82004524900304
5.81	45.87	1009.63	94.38	479.66	[1,4,[],[5.81,45.87,1009.63,94.38]]	479.5016658987375
5.82	40.78	1024.82	96.01	470.02	[1,4,[],[5.82,40.78,1024.82,96.01]]	481.8616297971032
5.85	40.77	1022.44	84.77	480.59	[1,4,[],[5.85,40.77,1022.44,84.77]]	483.25643025675544
5.89	39.48	1005.11	59.83	484.91	[1,4,[],[5.89,39.48,1005.11,59.83]]	485.6707676138384
5.97	40.35	1012.3	94.1	489.03	[1,4,[],[5.97,40.35,1012.3,94.1]]	480.8720151235934
5.98	39.61	1017.27	84.86	482.17	[1,4,[],[5.98,39.61,1017.27,84.86]]	482.8376771392271
5.99	35.79	1011.56	91.69	484.82	[1,4,[],[5.99,35.79,1011.56,91.69]]	482.28195572617585
6.01	35.79	1011.05	91.33	482.25	[1,4,[],[6.01,35.79,1011.05,91.33]]	482.25230635662086
6.05	41.14	1027.69	86.93	481.02	[1,4,[],[6.05,41.14,1027.69,86.93]]	482.9211493842233
6.06	41.17	1019.67	84.7	489.62	[1,4,[],[6.06,41.17,1019.67,84.7]]	482.5224032770931
6.06	41.17	1019.67	84.7	489.62	[1,4,[],[6.06,41.17,1019.67,84.7]]	482.5224032770931
6.07	41.14	1027.57	86.98	480.19	[1,4,[],[6.07,41.14,1027.57,86.98]]	482.8651227775144
6.09	43.65	1020.95	71.15	485.96	[1,4,[],[6.09,43.65,1020.95,71.15]]	483.9457142125588
6.13	40.81	1026.31	91.66	483.8	[1,4,[],[6.13,40.81,1026.31,91.66]]	482.03413003220066
6.14	39.4	1011.21	90.87	485.94	[1,4,[],[6.14,39.4,1011.21,90.87]]	481.1697787554499
6.17	36.25	1028.68	90.59	483.77	[1,4,[],[6.17,36.25,1028.68,90.59]]	483.4800950250837
6.17	39.33	1012.57	93.32	491.54	[1,4,[],[6.17,39.33,1012.57,93.32]]	480.887862061189
6.28	41.06	1020.96	90.91	489.79	[1,4,[],[6.28,41.06,1020.96,90.91]]	481.3274744321715
6.28	43.02	1013.72	88.13	487.17	[1,4,[],[6.28,43.02,1013.72,88.13]]	480.60722518885586
6.29	40.78	1024.75	96.37	478.29	[1,4,[],[6.29,40.78,1024.75,96.37]]	480.90552075385773
6.34	40.64	1020.62	94.39	478.78	[1,4,[],[6.34,40.64,1020.62,94.39]]	480.7766850294918
6.4	39.9	1007.75	86.55	486.03	[1,4,[],[6.4,39.9,1007.75,86.55]]	480.8813804440216
6.41	40.81	1026.57	93.51	484.49	[1,4,[],[6.41,40.81,1026.57,93.51]]	481.2497160345687
6.48	36.24	1013.62	92.03	484.65	[1,4,[],[6.48,36.24,1013.62,92.03]]	481.36256219865294
6.48	40.27	1010.55	82.12	486.68	[1,4,[],[6.48,40.27,1010.55,82.12]]	481.5327774754329
6.54	39.33	1011.54	93.69	491.16	[1,4,[],[6.54,39.33,1011.54,93.69]]	480.0372112529075
6.57	39.37	1020.2	77.37	487.94	[1,4,[],[6.57,39.37,1020.2,77.37]]	483.13326820062326
6.59	39.37	1020.34	77.92	488.17	[1,4,[],[6.59,39.37,1020.34,77.92]]	483.026231339655
6.67	36.08	1022.31	83.51	486.52	[1,4,[],[6.67,36.08,1022.31,83.51]]	483.05644648396395
6.67	39.37	1019.99	75.61	486.84	[1,4,[],[6.67,39.37,1019.99,75.61]]	483.1836235447748
6.69	36.24	1013.35	91.09	483.82	[1,4,[],[6.69,36.24,1013.35,91.09]]	481.07683881182743
6.71	40.72	1022.78	80.69	483.11	[1,4,[],[6.71,40.72,1022.78,80.69]]	482.2593488420791
6.72	39.85	1011.84	84.66	489.09	[1,4,[],[6.72,39.85,1011.84,84.66]]	480.91956153647465
6.75	39.37	1020.26	70.99	486.26	[1,4,[],[6.75,39.37,1020.26,70.99]]	483.7358441723225
6.75	39.9	1008.3	87.42	484.05	[1,4,[],[6.75,39.9,1008.3,87.42]]	480.13324493534134
6.76	36.25	1028.31	91.16	484.36	[1,4,[],[6.76,36.25,1028.31,91.16]]	482.2378034745739
6.8	39.37	1020.24	73.29	487.33	[1,4,[],[6.8,39.37,1020.24,73.29]]	483.29951117842876
6.81	37.49	1010.74	88.25	482.21	[1,4,[],[6.81,37.49,1010.74,88.25]]	480.72127979674934
6.81	38.56	1016.5	70.99	487.45	[1,4,[],[6.81,38.56,1016.5,70.99]]	483.4983346847084
6.82	41.03	1022.12	87.63	489.64	[1,4,[],[6.82,41.03,1022.12,87.63]]	480.8896654047192
6.84	39.4	1011.9	93.75	484.09	[1,4,[],[6.84,39.4,1011.9,93.75]]	479.4695661535221
6.86	40.02	1031.5	77.94	476.45	[1,4,[],[6.86,40.02,1031.5,77.94]]	483.31837237370024
6.86	41.38	1021.35	90.78	487.49	[1,4,[],[6.86,41.38,1021.35,90.78]]	480.1926904623461
6.86	42.49	1007.95	93.96	486.14	[1,4,[],[6.86,42.49,1007.95,93.96]]	478.2709460554042
6.87	40.07	1017.91	57.64	491.4	[1,4,[],[6.87,40.07,1017.91,57.64]]	485.0924630969619
6.89	39.37	1020.21	74.17	486.9	[1,4,[],[6.89,39.37,1020.21,74.17]]	482.99537247457505
6.89	43.65	1019.87	72.77	484	[1,4,[],[6.89,43.65,1019.87,72.77]]	482.0857905249771
6.91	36.08	1021.82	84.31	486.37	[1,4,[],[6.91,36.08,1021.82,84.31]]	482.4376580053312
6.91	37.49	1011.05	82.07	481.88	[1,4,[],[6.91,37.49,1011.05,82.07]]	481.46887563754206
6.93	40.67	1020.17	71.16	494.61	[1,4,[],[6.93,40.67,1020.17,71.16]]	483.02945770729485
6.93	41.14	1027.18	84.67	479.06	[1,4,[],[6.93,41.14,1027.18,84.67]]	481.53054017882704
6.94	38.91	1018.94	90.64	485.12	[1,4,[],[6.94,38.91,1018.94,90.64]]	480.47697065614113

Showing the first 1000 rows.

// First we calculate the residual error and divide it by the RMSE from predictionsAndLabels DataFrame and make another DataFrame that is registered as a temporary table Power_Plant_RMSE_Evaluation
predictionsAndLabels.selectExpr("PE", "Predicted_PE", "PE - Predicted_PE AS Residual_Error", s""" (PE - Predicted_PE) / $rmse AS Within_RSME""").createOrReplaceTempView("Power_Plant_RMSE_Evaluation")

%sql SELECT * from Power_Plant_RMSE_Evaluation


490.55	492.8503868481024	-2.3003868481023915	-0.49906688419119855
489.86	483.9368120270272	5.923187972972812	1.2850303715606821
489.11	483.850459922409	5.259540077590998	1.1410525499080058
459.86	487.4251507226833	-27.565150722683313	-5.980234974295072
489.07	488.37401129434335	0.6959887056566458	0.15099413172652035
487.87	487.1505396071426	0.7194603928573997	0.1560862934243033
488.67	487.6355351796776	1.0344648203223983	0.22442622427161782
488.81	483.9896378767201	4.820362123279892	1.045773282664624
490.79	484.0618847149547	6.728115285045305	1.4596586372229519
489.8	483.8158776062654	5.984122393734594	1.2982500400415133
491.22	483.77650720118083	7.443492798819193	1.6148591536552597
489	487.61554926022393	1.3844507397760708	0.30035535874594327
491.25	483.6343648504441	7.615635149555885	1.6522052838030554
491.9	485.6397981724803	6.260201827519666	1.3581452280713195
486.53	487.37706899378225	-0.8470689937822726	-0.1837708660917696
492.85	485.96972834538735	6.88027165461267	1.4926688265015375
489.36	485.11862159667663	4.241378403323381	0.9201632786974722
486.34	486.79899634464203	-0.45899634464205974	-0.09957884942900971
486.6	481.99694115337115	4.603058846628869	0.9986295297379263
494.91	487.0395505377602	7.870449462239833	1.707487022691192
481.47	482.7127506383782	-1.2427506383781974	-0.26961364756264844
485.32	486.9191795580812	-1.5991795580812322	-0.346940585220358
488.63	483.6685673220653	4.961432677934681	1.076378414240979
482.69	480.3452494934288	2.344750506571188	0.5086915405056825
485.67	483.6820847979367	1.9879152020633342	0.4312764380253951
486.09	482.5556900620063	3.534309937993669	0.766765402947556
494.75	486.76704382567345	7.982956174326546	1.7318952539841284
494.87	487.39276206190476	7.477237938095243	1.6221801316590196
482.46	483.2391853805797	-0.7791853805797473	-0.16904357648108023
486.67	485.46804748846023	1.2019525115397869	0.26076253016955486
495.23	486.8376282047915	8.392371795208533	1.8207176094159236
485.24	482.9664790826128	2.273520917387202	0.49323834437669456
488.65	486.52337424855034	2.1266257514496374	0.4613695685541915
479.48	479.7020461747409	-0.22204617474085353	-0.048172720456085526
481.56	483.8603707725907	-2.3003707725907248	-0.49906339662321586
492.39	481.59632996620337	10.793670033796616	2.3416771301741583
484.25	482.9589094130443	1.2910905869557041	0.2801009564856845
487.09	483.5935440196594	3.4964559803405564	0.7585530204450963
475.54	482.56492081062197	-7.02492081062195	-1.5240503324643226
485.18	483.05341208868646	2.1265879113135497	0.4613613591702654
478.44	479.8666424772226	-1.4266424772226287	-0.30950881872309294
482.82	482.99898248352287	-0.17898248352287283	-0.03883009088243005
485.83	481.0140181620884	4.815981837911579	1.044822983643207
487.8	482.6522450331836	5.1477549668164215	1.1168008693790303
490.5	481.65803626550104	8.841963734498961	1.9182561814540322
481.4	479.4940275935438	1.9059724064561578	0.41349902127511057
487.4	485.55280779089935	1.8471922091006263	0.4007467096495387
481.6	479.82004524900304	1.7799547509969784	0.3861596028138321
479.66	479.5016658987375	0.15833410126253966	0.03435044268466979
470.02	481.8616297971032	-11.841629797103224	-2.5690310703440926
480.59	483.25643025675544	-2.6664302567554614	-0.5784796766899505
484.91	485.6707676138384	-0.7607676138383681	-0.16504785833960212
489.03	480.8720151235934	8.157984876406545	1.7698675754930742
482.17	482.8376771392271	-0.6676771392270666	-0.14485196252735383
484.82	482.28195572617585	2.5380442738241413	0.55062645168642
482.25	482.25230635662086	-0.0023063566208634256	-0.0005003620210911666
481.02	482.9211493842233	-1.901149384223345	-0.41245267088404464
489.62	482.5224032770931	7.097596722906928	1.5398173071058865
489.62	482.5224032770931	7.097596722906928	1.5398173071058865
480.19	482.8651227775144	-2.675122777514389	-0.5803655113505441
485.96	483.9457142125588	2.0142857874411675	0.43699751310877494
483.8	482.03413003220066	1.7658699677993468	0.3831039216049307
485.94	481.1697787554499	4.7702212445501	1.0348952635440887
483.77	483.4800950250837	0.28990497491628275	0.06289462690257779
491.54	480.887862061189	10.652137938810995	2.3109718678328295
489.79	481.3274744321715	8.462525567828493	1.8359374080965598
487.17	480.60722518885586	6.562774811144152	1.423788168215266
478.29	480.90552075385773	-2.6155207538577088	-0.5674349052386345
478.78	480.7766850294918	-1.9966850294918004	-0.43317904429930487
486.03	480.8813804440216	5.148619555978371	1.11698844123005
484.49	481.2497160345687	3.240283965431331	0.7029767292646845
484.65	481.36256219865294	3.287437801347039	0.7132067120988882
486.68	481.5327774754329	5.147222524567098	1.116685356505793
491.16	480.0372112529075	11.1227887470925	2.4130791428004637
487.94	483.13326820062326	4.806731799376735	1.0428161959961462
488.17	483.026231339655	5.1437686603450175	1.1159360437294854
486.52	483.05644648396395	3.463553516036029	0.7514148600282003
486.84	483.1836235447748	3.6563764552251996	0.7932476254785464
483.82	481.07683881182743	2.7431611881725644	0.5951263841317928
483.11	482.2593488420791	0.8506511579209359	0.1845480134210629
489.09	480.91956153647465	8.170438463525329	1.7725693701610024
486.26	483.7358441723225	2.5241558276774754	0.5476133656263986
484.05	480.13324493534134	3.9167550646586733	0.8497365336059749
484.36	482.2378034745739	2.122196525426091	0.46040865190107577
487.33	483.29951117842876	4.030488821571225	0.8744109711843887
482.21	480.72127979674934	1.4887202032506366	0.3229765262662956
487.45	483.4983346847084	3.9516653152915637	0.8573102814841035
489.64	480.8896654047192	8.750334595280776	1.8983773210578065
484.09	479.4695661535221	4.620433846477852	1.0023990205279676
476.45	483.31837237370024	-6.868372373700254	-1.4900872880729146
487.49	480.1926904623461	7.297309537653916	1.5831448249410072
486.14	478.2709460554042	7.869053944595805	1.7071842663773948
491.4	485.0924630969619	6.307536903038056	1.3684145306764033
486.9	482.99537247457505	3.904627525424928	0.8471054747372092
484	482.0857905249771	1.9142094750229148	0.4152860459870071
486.37	482.4376580053312	3.932341994668832	0.8531181042321038
481.88	481.46887563754206	0.41112436245794015	0.08919306539951342
494.61	483.02945770729485	11.580542292705161	2.5123883680835215
479.06	481.53054017882704	-2.4705401788270365	-0.5359814982134238
485.12	480.47697065614113	4.643029343858871	1.0073010936223248

Showing the first 1000 rows.

%sql -- Now we can display the RMSE as a Histogram. Clearly this shows that the RMSE is centered around 0 with the vast majority of the error within 2 RMSEs.
SELECT Within_RSME  from Power_Plant_RMSE_Evaluation

%sql 
SELECT case when Within_RSME <= 1.0 and Within_RSME >= -1.0 then 1  when  Within_RSME <= 2.0 and Within_RSME >= -2.0 then 2 else 3 end RSME_Multiple, COUNT(*) count  from Power_Plant_RMSE_Evaluation
group by case when Within_RSME <= 1.0 and Within_RSME >= -1.0 then 1  when  Within_RSME <= 2.0 and Within_RSME >= -2.0 then 2 else 3 end

import org.apache.spark.ml.tuning.{ParamGridBuilder, CrossValidator}
import org.apache.spark.ml.evaluation._

import org.apache.spark.ml.tuning.{ParamGridBuilder, CrossValidator} import org.apache.spark.ml.evaluation._

//Let's set up our evaluator class to judge the model based on the best root mean squared error
val regEval = new RegressionEvaluator()
regEval.setLabelCol("PE")
  .setPredictionCol("Predicted_PE")
  .setMetricName("rmse")

regEval: org.apache.spark.ml.evaluation.RegressionEvaluator = regEval_7e20af62a956 res44: regEval.type = regEval_7e20af62a956

//Let's create our crossvalidator with 3 fold cross validation
val crossval = new CrossValidator()
crossval.setEstimator(lrPipeline)
crossval.setNumFolds(3)
crossval.setEvaluator(regEval)

crossval: org.apache.spark.ml.tuning.CrossValidator = cv_db8f394bea34 res45: crossval.type = cv_db8f394bea34

SDS-2.x, Scalable Data Engineering Science

Power Plant ML Pipeline Application

Table of Contents

Now we will do the following Steps:

Step 5: Data Preparation,

Step 6: Modeling, and

Step 7: Tuning and Evaluation

Step 5: Data Preparation

Step 6: Data Modeling

Linear Regression Model

Step 7: Tuning and Evaluation