%md
This is the Spark SQL parts of an end-to-end example of using a number of different machine learning algorithms to solve a supervised regression problem.
This is a break-down of *Power Plant ML Pipeline Application* from databricks.
**This will be a recurring example in the sequel**
- **Step 1: Business Understanding**
- **Step 2: Load Your Data**
- **Step 3: Explore Your Data**
- **Step 4: Visualize Your Data**
- *Step 5: Data Preparation*
- *Step 6: Data Modeling*
- *Step 7: Tuning and Evaluation*
- *Step 8: Deployment*
*We are trying to predict power output given a set of readings from various sensors in a gas-fired power generation plant. Power generation is a complex process, and understanding and predicting power output is an important element in managing a plant and its connection to the power grid.*
* Given this business problem, we need to translate it to a Machine Learning task (actually a *Statistical* Machine Learning task).
* The ML task here is *regression* since the label (or target) we will be trying to predict takes a *continuous numeric* value
* Note: if the labels took values from a finite discrete set, such as, / or //, then the ML task would be *classification*.
**Today, we will only cover Steps 1, 2, 3 and 4 above**. You need introductions to linear algebra, stochastic gradient descent and decision trees before we can accomplish the **applied ML task** with some intuitive understanding. If you can't wait for ML then **check out [Spark MLLib Programming Guide](https://spark.apache.org/docs/latest/mllib-guide.html) for comming attractions!**
The example data is provided by UCI at [UCI Machine Learning Repository Combined Cycle Power Plant Data Set](https://archive.ics.uci.edu/ml/datasets/Combined+Cycle+Power+Plant)
You can read the background on the UCI page, but in summary:
* we have collected a number of readings from sensors at a Gas Fired Power Plant (also called a Peaker Plant) and
* want to use those sensor readings to predict how much power the plant will generate in a couple weeks from now.
* Again, today we will just focus on Steps 1-4 above that pertain to DataFrames.
More information about Peaker or Peaking Power Plants can be found on Wikipedia [https://en.wikipedia.org/wiki/Peaking_power_plant](https://en.wikipedia.org/wiki/Peaking_power_plant).
ScaDaMaLe Course site and book