ScaDaMaLe Course site and book

Introduction to Machine Learning

Some very useful resources we will weave around for Statistical Learning, Data Mining, Machine Learning:

January 2014, Stanford University professors Trevor Hastie and Rob Tibshirani (authors of the legendary Elements of Statistical Learning textbook) taught an online course based on their newest textbook, An Introduction to Statistical Learning with Applications in R (ISLR).
https://www.dataschool.io/15-hours-of-expert-machine-learning-videos/
free PDF of the ISLR book: http://www-bcf.usc.edu/~gareth/ISL/
A more theoretically sound book with interesting aplications is Elements of Statistical Learning by the Stanford Gang of 3 (Hastie, Tibshirani and Friedman):
- free PDF of the 10th printing: http://statweb.stanford.edu/~tibs/ElemStatLearn/printings/ESLII_print10.pdf
- Solutions: http://waxworksmath.com/Authors/GM/Hastie/WriteUp/weatherwaxepsteinhastiesolutions_manual.pdf
A great series on Probabilistic ML by Kevin P. Murphy https://probml.github.io/pml-book/.

Deep Learning is a popular method currently (2022) in Machine Learning.

Note: We will focus on intution here and the distributed ML Pipeline in action as most of you already have some exposure to ML concepts and methods.

Summary of Machine Learning at a High Level

A rough definition of machine learning.
- constructing and studying algorithms that learn from and make predictions on data.
This broad area involves tools and ideas from various domains, including:
- computer science,
- probability and statistics,
- optimization,
- linear algebra
- logic
- etc.
Common examples of ML, include:
- facial recognition,
- link prediction,
- text or document classification, eg.::
  - spam detection,
- protein structure prediction
- teaching computers to play games (go!)

Some common terminology

using example of spam detection as a running example.

the data points we learn from are call observations:
- they are items or entities used for::
  - learning or
  - evaluation.
in the context of spam detection,
- emails are our observations.
- Features are attributes used to represent an observation.
- Features are typically numeric,
  - and in spam detection, they can be:
  - the length,
  - the date, or
  - the presence or absence of keywords in emails.
- Labels are values or categories assigned to observations.
  - and in spam detection, they can be:
  - an email being defined as spam or not spam.
Training and test data sets are the observations that we use to train and evaluate a learning algorithm.
Pop-Quiz
- What is the difference between supervised and unsupervised learning?

If you are interested, watch this later (12:12) for a Stats@Stanford Hastie-Tibshirani Perspective on Supervised and Unsupervised Learning from https://www.dataschool.io/15-hours-of-expert-machine-learning-videos/ - an effective way to brush up on ML if you are rusty.

ML Pipelines

Expected Reading

Here we will use ML Pipelines to do machine learning at scale.

See https://spark.apache.org/docs/latest/ml-pipeline.html for a quick overview (about 10 minutes of reading).

Read this section for an overview:

https://learning.oreilly.com/library/view/spark-the-definitive/9781491912201/ch24.html#high-level-mllib-concepts.

sds-3.x/ScaDaMaLe