{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"# 00. Introduction \n",
"## [Mathematical Statistical and Computational Foundations for Data Scientists](https://lamastex.github.io/scalable-data-science/360-in-525/2018/04/)\n",
"\n",
"©2018 Raazesh Sainudiin. [Attribution 4.0 International (CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/)\n",
"\n",
"1. Introduction\n",
"2. What is SageMath and why are we using it?\n",
"* Interaction - learning/teaching style\n",
"* What can you expect to get out of this course? \n",
"\n",
"\n",
"## Introduction\n",
"\n",
"This is a course over three full-day workshops (3 hp) on May 11, 18 and 25 2018. \n",
"\n",
"*Prerequisites:* current proficiency in high-school level mathematics (pre-calculus, geometry and algebra with some programming experience beyond Excel). \n",
"\n",
"*Target Audience:* any MSc or PhD student at UU who wants to understand the mathematical statistical foundations in the data scientistâ€™s computational toolbox. The approach will use formal mathematical communication of concepts starting from sets and logic, but with concomitant development of computer programming skills to algorithmically construct and implement the concepts. Topics will include: Sets, Maps, Functions, Modular Arithmetic, Axiomatic Probability, Conditional probability, Pseudo-random constructive understanding of random variables and structures including graphs, Statistics, Likelihood Principle, Bayes Rule, Decisions (parametric and non-parametric) including tests and estimators, Markov chains and their pseudorandom constructions, etc. \n",
"\n",
"This course is designed to help you take the first steps along the mathematical-statistical-computational path of being a data scientist.\n",
"\n",
"## What is SageMath and why are we using it?\n",
"\n",
"We will be using Sage or [SageMath](http://www.sagemath.org/) for our *hands-on* work in this course. Sage is a free open-source mathematics software system licensed under the GPL. Sage can be used to study mathematics and statistics, including algebra, calculus, elementary to very advanced number theory, cryptography, commutative algebra, group theory, combinatorics, graph theory, exact linear algebra, optimization, interactive data visualization, randomized or Monte Carlo algorithms, scientific and statistical computing and much more. It combines various software packages into an integrative learning, teaching and research experience that is well suited for novice as well as professional researchers.\n",
" \n",
"Sage is a set of software libraries built on top of [Python](http://www.python.org/), a widely used general purpose programming language. Sage greatly enhance Python's already mathematically friendly nature. It is one of the languages used at Google, US National Aeronautic and Space Administration (NASA), US Jet Propulsion Laboratory (JPL), Industrial Light and Magic, YouTube, and other leading entities in industry and public sectors. Scientists, engineers, and mathematicians often find it well suited for their work. Obtain a more thorough rationale for Sage from Why Sage? and Success Stories, Testimonials and News Articles. Jump start your motivation by taking a Sage Feature Tour right now!\n",
"\n",
"## Interaction - learning/teaching style\n",
"\n",
"This is an interactive jupyter notebook with SageMath interpreter and interactive means...\n",
"\n",
"#### Videos\n",
"We will embed relevant videos in the notebook, such as those from [The Khan Academy](http://www.khanacademy.org/) or open MOOCs from google, facebook, academia, etc.\n",
"\n",
"* [watch Google's Hal Varian's 'statistics is the dream' job speech](https://www.youtube.com/embed/D4FQsYTbLoI)\n",
"* [watch UC Berkeley Professor Michael Jordan's speech on 'The Data Science Revolution'](https://youtu.be/ggq7HiDO0OU)\n",
"\n",
"#### Latex\n",
"We will *formally present mathematical and statistical concepts* in the Notebook using Latex as follows:\n",
"\n",
"$$ \\sum_{i=1}^5 i = 1+2+3+4+5=15, \\qquad \\prod_{i=3}^6 i = 3 \\times 4 \\times 5 \\times 6 = 360 $$\n",
"\n",
"$$ \\binom{n}{k}:= \\frac{n!}{k!(n-k)!}, \\qquad \\lim_{x \\to \\infty}\\exp{(-x)} = 0 $$\n",
"\n",
"$$ \\{\\alpha, \\beta, \\gamma, \\delta, \\epsilon, \\zeta, \\mu,\\theta, \\vartheta, \\phi, \\varphi, \\omega, \\sigma, \\varsigma,\\Gamma, \\Delta, \\Theta, \\Phi, \\Omega\\}, \\qquad \\forall x \\in X, \\quad \\exists y \\leq \\epsilon, \\ldots $$\n",
"\n",
"#### Interactive Visualizations\n",
"We will use interactive visualisations to convey concepts when possible. See the Taylor approximation below for a given order."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "7d2b4ff175b047a6a7f2f4e5d928efef"
}
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"var('x')\n",
"x0 = 0\n",
"f = sin(x)*e^(-x)\n",
"p = plot(f,-1,5, thickness=2)\n",
"dot = point((x0,f(x=x0)),pointsize=80,rgbcolor=(1,0,0))\n",
"@interact\n",
"def _(order=[1..12]):\n",
" ft = f.taylor(x,x0,order)\n",
" pt = plot(ft,-1, 5, color='green', thickness=2)\n",
" pretty_print(html('$f(x)\\;=\\;%s$'%latex(f)))\n",
" pretty_print(html('$\\hat{f}(x;%s)\\;=\\;%s+\\mathcal{O}(x^{%s})$'%(x0,latex(ft),order+1)))\n",
" show(dot + p + pt, ymin = -.5, ymax = 1, figsize=[6,3])"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"#### Lab-Lecture Style of Teaching-Learning\n",
"\n",
"We will *write computer programs* within code cells in the Notebook right after we learn the mathematical and statistical concepts. \n",
"\n",
"Thus, there is a significant overlap between traditional lectures and labs in this course -- in fact these interactions are *lab-lectures*.\n",
"\n",
"#### Live Data Explorations and Modeling\n",
"Let us visualize the CO2 data, fetched from US NOAA, and do a simple linear regression."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "0b8f2bc6fe244ce5a1f91b024d5a02ae"
}
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Author: Marshall Hampton \n",
"import urllib2 as U\n",
"import scipy.stats as Stat\n",
"from IPython.display import HTML\n",
"co2data = U.urlopen('ftp://ftp.cmdl.noaa.gov/ccg/co2/trends/co2_mm_mlo.txt').readlines()\n",
"datalines = []\n",
"for a_line in co2data:\n",
" if a_line.find('Creation:') != -1:\n",
" cdate = a_line\n",
" if a_line[0] != '#':\n",
" temp = a_line.replace('\\n','').split(' ')\n",
" temp = [float(q) for q in temp if q != '']\n",
" datalines.append(temp)\n",
"trdf = RealField(16)\n",
"@interact\n",
"def mauna_loa_co2(start_date = slider(1958,2010,1,1958), end_date = slider(1958, 2010,1,2009)):\n",
" htmls1 = '### CO2 monthly averages at Mauna Loa (interpolated), from NOAA/ESRL data

'\n",
" htmls2 = '#### '+cdate+'

'\n",
" sel_data = [[q[2],q[4]] for q in datalines if start_date < q[2] < end_date]\n",
" c_max = max([q[1] for q in sel_data])\n",
" c_min = min([q[1] for q in sel_data])\n",
" slope, intercept, r, ttprob, stderr = Stat.linregress(sel_data)\n",
" pretty_print(html(htmls1+htmls2+'#### Linear regression slope: ' + str(trdf(slope))\n",
" + ' ppm/year; correlation coefficient: ' + str(trdf(r)) + '

'))\n",
" var('x,y')\n",
" show(list_plot(sel_data, plotjoined=True, rgbcolor=(1,0,0)) \n",
" + plot(slope*x+intercept,start_date,end_date), \n",
" xmin = start_date, ymin = c_min-2, axes = True, xmax = end_date, ymax = c_max+3, \n",
" frame = False, figsize=[8,3])"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"### We will use publicly available resources generously!"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"