ScaDaMaLe Course site and book

Introduction to Machine Learning

Some very useful resources we will weave around for Statistical Learning, Data Mining, Machine Learning:

Deep Learning is a popular method currently (2022) in Machine Learning.

Note: We will focus on intution here and the distributed ML Pipeline in action as most of you already have some exposure to ML concepts and methods.

Summary of Machine Learning at a High Level

  • A rough definition of machine learning.
    • constructing and studying algorithms that learn from and make predictions on data.
  • This broad area involves tools and ideas from various domains, including:
    • computer science,
    • probability and statistics,
    • optimization,
    • linear algebra
    • logic
    • etc.
  • Common examples of ML, include:
    • facial recognition,
    • link prediction,
    • text or document classification, eg.::
      • spam detection,
    • protein structure prediction
    • teaching computers to play games (go!)

Some common terminology

using example of spam detection as a running example.
  • the data points we learn from are call observations:

    • they are items or entities used for::
      • learning or
      • evaluation.
  • in the context of spam detection,

    • emails are our observations.
    • Features are attributes used to represent an observation.
    • Features are typically numeric,
      • and in spam detection, they can be:
      • the length,
      • the date, or
      • the presence or absence of keywords in emails.
    • Labels are values or categories assigned to observations.
      • and in spam detection, they can be:
      • an email being defined as spam or not spam.
  • Training and test data sets are the observations that we use to train and evaluate a learning algorithm.

  • Pop-Quiz

    • What is the difference between supervised and unsupervised learning?

If you are interested, watch this later (12:12) for a Stats@Stanford Hastie-Tibshirani Perspective on Supervised and Unsupervised Learning from https://www.dataschool.io/15-hours-of-expert-machine-learning-videos/ - an effective way to brush up on ML if you are rusty.

ML Pipelines

Expected Reading

Here we will use ML Pipelines to do machine learning at scale.

See https://spark.apache.org/docs/latest/ml-pipeline.html for a quick overview (about 10 minutes of reading).

Read this section for an overview: