// Databricks notebook source exported at Sun, 19 Jun 2016 08:36:55 UTC

Scalable Data Science

prepared by Raazesh Sainudiin and Sivanand Sivaram

supported by and

The html source url of this databricks notebook and its recorded Uji Image of Uji, Dogen's Time-Being:

sds/uji/week4/06_MLIntro/011_IntroToML

Some very useful resources we will weave around for Statistical Learning, Data Mining, Machine Learning:

Note: This is an applied course in data science and we will quickly move to doing things with data. You have to do work to get a deeper mathematical understanding or take other courses. We will focus on intution here and the distributed ML Pipeline in action.

I may consider teaching a theoretical course in statistical learning if there is enough interest for those with background in:

  • Real Analysis,
  • Geometry,
  • Combinatorics and
  • Probability, in the future.

Such a course could be an expanded version of the following notes built on the classic works of Luc Devroye and the L1-School of Statistical Learning:

Machine Learning Introduction

ML Intro high-level by Ameet Talwalkar in BerkeleyX: CS190.1x Scalable Machine Learning

(watch now 4:14):

ML Intro high-level by Ameet Talwalkar in BerkeleyX: CS190.1x Scalable Machine Learning

Ameet’s Summary of Machine Learning at a High Level

  • rough definition of machine learning.
    • constructing and studying methods that learn from and make predictions on data.
  • This broad area involves tools and ideas from various domains, including:
    • computer science,
    • probability and statistics,
    • optimization, and
    • linear algebra.
  • Common examples of ML, include:
    • face recognition,
    • link prediction,
    • text or document classification, eg.::
      • spam detection,
      • protein structure prediction
      • teaching computers to play games (go!)

Some common terminology

using example of spam detection as a running example.
  • the data points we learn from are call observations:
    • they are items or entities used for::
      • learning or
      • evaluation.
  • in the context of spam detection,
    • emails are our observations.
    • Features are attributes used to represent an observation.
    • Features are typically numeric,
      • and in spam detection, they can be:
      • the length,
      • the date, or
      • the presence or absence of keywords in emails.
    • Labels are values or categories assigned to observations.
      • and in spam detection, they can be:
      • an email being defined as spam or not spam.
  • Training and test data sets are the observations that we use to train and evaluate a learning algorithm. …

  • Pop-Quiz
    • What is the difference between supervised and unsupervised learning?
For a Stats@Stanford Hastie-Tibshirani Perspective on Supervised and Unsupervised Learning:

(watch later 12:12):

Supervised and Unsupervised Learning (12:12)

Typical Supervised Learning Pipeline by Ameet Talwalkar in BerkeleyX: CS190.1x Scalable Machine Learning

(watch now 2:07):

Typical Supervised Learning Pipeline by Ameet Talwalkar in BerkeleyX: CS190.1x Scalable Machine Learning

Take your own notes if you want ….

Sample Classification Pipeline (Spam Filter) by Ameet Talwalkar in BerkeleyX: CS190.1x Scalable Machine Learning

(watch later 7:48):

Typical Supervised Learning Pipeline by Ameet Talwalkar in BerkeleyX: CS190.1x Scalable Machine Learning

Scalable Data Science

prepared by Raazesh Sainudiin and Sivanand Sivaram

supported by and

Updated: