009_PowerPlantPipeline_01ETLEDA(Scala)

Loading...

ScaDaMaLe Course site and book

Power Plant ML Pipeline Application - DataFrame Part

This is the Spark SQL parts of an end-to-end example of using a number of different machine learning algorithms to solve a supervised regression problem.

This is a break-down of Power Plant ML Pipeline Application from databricks.

This will be a recurring example in the sequel

Table of Contents
  • Step 1: Business Understanding
  • Step 2: Load Your Data
  • Step 3: Explore Your Data
  • Step 4: Visualize Your Data
  • Step 5: Data Preparation
  • Step 6: Data Modeling
  • Step 7: Tuning and Evaluation
  • Step 8: Deployment

We are trying to predict power output given a set of readings from various sensors in a gas-fired power generation plant. Power generation is a complex process, and understanding and predicting power output is an important element in managing a plant and its connection to the power grid.

  • Given this business problem, we need to translate it to a Machine Learning task (actually a Statistical Machine Learning task).
  • The ML task here is regression since the label (or target) we will be trying to predict takes a continuous numeric value
    • Note: if the labels took values from a finite discrete set, such as, Spam/Not-Spam or Good/Bad/Ugly, then the ML task would be classification.

Today, we will only cover Steps 1, 2, 3 and 4 above. You need introductions to linear algebra, stochastic gradient descent and decision trees before we can accomplish the applied ML task with some intuitive understanding. If you can't wait for ML then check out Spark MLLib Programming Guide for comming attractions!

The example data is provided by UCI at UCI Machine Learning Repository Combined Cycle Power Plant Data Set

You can read the background on the UCI page, but in summary:

  • we have collected a number of readings from sensors at a Gas Fired Power Plant (also called a Peaker Plant) and
  • want to use those sensor readings to predict how much power the plant will generate in a couple weeks from now.
  • Again, today we will just focus on Steps 1-4 above that pertain to DataFrames.

More information about Peaker or Peaking Power Plants can be found on Wikipedia https://en.wikipedia.org/wiki/Peaking_power_plant.

Show code
displayHTML(frameIt("https://archive.ics.uci.edu/ml/datasets/Combined+Cycle+Power+Plant",500))
sc.version.replace(".", "").toInt
res2: Int = 301
// a good habit to ensure the code is being run on the appropriate version of Spark - we are using Spark 2.2 actually if we use SparkSession object spark down the road...
require(sc.version.replace(".", "").toInt >= 140, "Spark 1.4.0+ is required to run this notebook. Please attach it to a Spark 1.4.0+ cluster.")

Step 1: Business Understanding

The first step in any machine learning task is to understand the business need.

As described in the overview we are trying to predict power output given a set of readings from various sensors in a gas-fired power generation plant.

The problem is a regression problem since the label (or target) we are trying to predict is numeric

Step 2: Load Your Data

Now that we understand what we are trying to do, we need to load our data and describe it, explore it and verify it.

Data was downloaded already as these five Tab-separated-variable or tsv files.

display(dbutils.fs.ls("/databricks-datasets/power-plant/data")) // Ctrl+Enter
 
path
name
size
1
2
3
4
5
dbfs:/databricks-datasets/power-plant/data/Sheet1.tsv
Sheet1.tsv
308693
dbfs:/databricks-datasets/power-plant/data/Sheet2.tsv
Sheet2.tsv
308693
dbfs:/databricks-datasets/power-plant/data/Sheet3.tsv
Sheet3.tsv
308693
dbfs:/databricks-datasets/power-plant/data/Sheet4.tsv
Sheet4.tsv
308693
dbfs:/databricks-datasets/power-plant/data/Sheet5.tsv
Sheet5.tsv
308693

Showing all 5 rows.

Now let us load the data from the Tab-separated-variable or tsv text file into an RDD[String] using the familiar textFile method.

val powerPlantRDD = sc.textFile("/databricks-datasets/power-plant/data/Sheet1.tsv") // Ctrl+Enter
powerPlantRDD: org.apache.spark.rdd.RDD[String] = /databricks-datasets/power-plant/data/Sheet1.tsv MapPartitionsRDD[187] at textFile at command-685894176422961:1
powerPlantRDD.take(5).foreach(println) // Ctrl+Enter to print first 5 lines
AT V AP RH PE 14.96 41.76 1024.07 73.17 463.26 25.18 62.96 1020.04 59.08 444.37 5.11 39.4 1012.16 92.14 488.56 20.86 57.32 1010.24 76.64 446.48
// let us make sure we are using Spark version greater than 2.2 - we need a version closer to 2.0 if we want to use SparkSession and SQLContext 
require(sc.version.replace(".", "").toInt >= 220, "Spark 2.2.0+ is required to run this notebook. Please attach it to a Spark 2.2.0+ cluster.")
// this reads the tsv file and turns it into a dataframe
val powerPlantDF = spark.read // use 'sqlContext.read' instead if you want to use older Spark version > 1.3  see 008_ notebook
    .format("csv") // use spark.csv package
    .option("header", "true") // Use first line of all files as header
    .option("inferSchema", "true") // Automatically infer data types
    .option("delimiter", "\t") // Specify the delimiter as Tab or '\t'
    .load("/databricks-datasets/power-plant/data/Sheet1.tsv")
powerPlantDF: org.apache.spark.sql.DataFrame = [AT: double, V: double ... 3 more fields]