00. Introduction

Inference Theory 1

©2018 Raazesh Sainudiin. Attribution 4.0 International (CC BY 4.0)

  1. Introduction
  2. What is SageMath and why are we using it?
  3. Interaction - learning/teaching style
  4. What can you expect to get out of this course?

Introduction

See Inference Theory 1.

What is SageMath and why are we using it?

We will be using Sage or SageMath for our hands-on work in this course. Sage is a free open-source mathematics software system licensed under the GPL. Sage can be used to study mathematics and statistics, including algebra, calculus, elementary to very advanced number theory, cryptography, commutative algebra, group theory, combinatorics, graph theory, exact linear algebra, optimization, interactive data visualization, randomized or Monte Carlo algorithms, scientific and statistical computing and much more. It combines various software packages into an integrative learning, teaching and research experience that is well suited for novice as well as professional researchers.

Sage is a set of software libraries built on top of Python, a widely used general purpose programming language. Sage greatly enhance Python's already mathematically friendly nature. It is one of the languages used at Google, US National Aeronautic and Space Administration (NASA), US Jet Propulsion Laboratory (JPL), Industrial Light and Magic, YouTube, and other leading entities in industry and public sectors. Scientists, engineers, and mathematicians often find it well suited for their work. Obtain a more thorough rationale for Sage from Why Sage? and Success Stories, Testimonials and News Articles. Jump start your motivation by taking a Sage Feature Tour right now!

Interaction - learning/teaching style

This is an interactive jupyter notebook with SageMath interpreter and interactive means...

Videos

We will embed relevant videos in the notebook, such as those from The Khan Academy or open MOOCs from google, facebook, academia, etc.

Latex

We will formally present mathematical and statistical concepts in the Notebook using Latex as follows:

$$ \sum_{i=1}^5 i = 1+2+3+4+5=15, \qquad \prod_{i=3}^6 i = 3 \times 4 \times 5 \times 6 = 360 $$$$ \binom{n}{k}:= \frac{n!}{k!(n-k)!}, \qquad \lim_{x \to \infty}\exp{(-x)} = 0 $$$$ \{\alpha, \beta, \gamma, \delta, \epsilon, \zeta, \mu,\theta, \vartheta, \phi, \varphi, \omega, \sigma, \varsigma,\Gamma, \Delta, \Theta, \Phi, \Omega\}, \qquad \forall x \in X, \quad \exists y \leq \epsilon, \ldots $$

Interactive Visualizations

We will use interactive visualisations to convey concepts when possible. See the Taylor approximation below for a given order.

In [1]:
var('x')
x0  = 0
f   = sin(x)*e^(-x)
p   = plot(f,-1,5, thickness=2)
dot = point((x0,f(x=x0)),pointsize=80,rgbcolor=(1,0,0))
@interact
def _(order=[1..12]):
    ft = f.taylor(x,x0,order)
    pt = plot(ft,-1, 5, color='green', thickness=2)
    pretty_print(html('$f(x)\;=\;%s$'%latex(f)))
    pretty_print(html('$\hat{f}(x;%s)\;=\;%s+\mathcal{O}\
                 (x^{%s})$'%(x0,latex(ft),order+1)))
    show(dot + p + pt, ymin = -.5, ymax = 1, figsize=[6,3])

Lab-Lecture Style of Teaching-Learning

We will write computer programs within code cells in the Notebook right after we learn the mathematical and statistical concepts.

Thus, there is a significant overlap between traditional lectures and labs in this course -- in fact these interactions are lab-lectures.

Live Data Explorations and Modeling

Let us visualize the CO2 data, fetched from US NOAA, and do a simple linear regression.

In [2]:
# Author: Marshall Hampton 
import urllib2 as U
import scipy.stats as Stat
from IPython.display import HTML
co2data = U.urlopen(\
          'ftp://ftp.cmdl.noaa.gov/ccg/co2/trends/co2_mm_mlo.txt'\
                   ).readlines()
datalines = []
for a_line in co2data:
    if a_line.find('Creation:') != -1:
        cdate = a_line
    if a_line[0] != '#':
        temp = a_line.replace('\n','').split(' ')
        temp = [float(q) for q in temp if q != '']
        datalines.append(temp)
trdf = RealField(16)
@interact
def mauna_loa_co2(start_date = slider(1958,2010,1,1958), \
                  end_date = slider(1958, 2010,1,2009)):
    htmls1 = '<h3>CO2 monthly averages at Mauna Loa (interpolated),\
    from NOAA/ESRL data</h3>'
    htmls2 = '<h4>'+cdate+'</h4>'
    sel_data = [[q[2],q[4]] for q in datalines if start_date < \
                q[2] < end_date]
    c_max = max([q[1] for q in sel_data])
    c_min = min([q[1] for q in sel_data])
    slope, intercept, r, ttprob, stderr = Stat.linregress(sel_data)
    pretty_print(html(htmls1+htmls2+'<h4>Linear regression slope: '\
                      + str(trdf(slope))+ \
                      ' ppm/year; correlation coefficient: ' +\
                      str(trdf(r)) + '</h4>'))
    var('x,y')
    show(list_plot(sel_data, plotjoined=True, rgbcolor=(1,0,0)) 
                   + plot(slope*x+intercept,start_date,end_date), 
                      xmin = start_date, ymin = c_min-2, axes = True, \
                      xmax = end_date, ymax = c_max+3, \
                      frame = False, figsize=[8,3])

We will use publicly available resources generously!

In [3]:
def showURL(url, ht=500):
    """Return an IFrame of the url to show in notebook \
       with height ht"""
    from IPython.display import IFrame
    return IFrame(url, width='95%', height=ht) 
showURL('https://en.wikipedia.org/wiki/Number',400)
Out[3]:

What can you expect to get out of this course?

While teaching SDS-2.2: Scalable Data Science from Atlantis, a fast-paced industrially aligned course in data science to research students at Uppsala University from various Departments in the Technical and Natural Sciences, I realized that the students have a significant variance in their mathemtical, statistical and computational backgrounds.

Most of the students of that course were able to learn and apply the methods and interpret the outputs of the models and methods on datasets. However, only those with a background in probability and statistics as well as computing were able to understand the models well enough to adapth them for the problem and dataset at hand - a crucial distinguishing skillset of a data scientist.

This course is nearly reverse-engineered from my experience in SDS-2.2 with the goal of making the mathematical, statistical and computational foundations reasonably strong for a data scientist who is fairly rusty on these interweaving foundations.

As summarised in the next section on Course Structure, this course is being designed to help you take your mathematical steps in the inferential direction in a computationally sound manner.

What is Data Science?

We will steer clear of academic/philosophical discussions on "what is data science?" and focus instead on the core skillset in mathematics, statistics and computing that is expected in a typical data science job today.

In [4]:
showURL("https://en.wikipedia.org/wiki/Data_science")
Out[4]:

Course Structure

I would like to customize the course for you! So would prefer to do the content week-by-week dynamically based on interactions and feedback.

However, if you want to have some idea of the structure for the course and complete some assigned exercises then take a look at Chapters 1-9, 11-14, 17-18 in CSEBook.pdf, one of my books under progress:

A Global Background and Context:

This is a mathematically more mature inference-theoretic variant of UC Berkeley's popular freshman course in data science, http://data8.org/, with the formula:

This course is aimed at covering the Syllabus of 1MS035: Inferensteori I for second-year undergraduate students of mathematics at Uppsala University, Uppsala, Sweden.

Scribed Black-Board Notes

One of your classmates has kindly agreed to allow me to make his hand-scribed notes available for the convenience of others in the class at the following link:

Summary

Thus, this course is being designed to help you take your mathematical steps in the inferential direction in a computationally sound manner.

In [ ]: