Loading [MathJax]/jax/output/HTML-CSS/jax.js

12. Linear Regression

  • Regression
    • linear models and their least-squares estimators
    • assessing fit using diagnostic plots (residual analysis)
    • multiple linear regression - not covered in detail and won't be on exam
    • prediction; not covered - not covered and won't be on exam
    • prelude to statistical ML - not covered and won't be on exam
  • Introduction to R in SageMath Jupyter IPython Notebook - SageMath/R

Introduction

Regression is a method for studying the relationship between a response variable Y and a covariate X. The covariate is also called a feature or a predictor variable.

A simple way to summarise the relationship between X and Y is through the regression function r(x):

r(x)=E(Y|X=x)=yf(y|x)dy

Our objective is to estimate the regression function r(x) from data of the form:

(Y1,X1),(Y2,X2),,(Yn,Xn)IIDFX,Y

We assume that FX,Y, the joint distribution of X and Y, is parametric and r is linear.

Simple Linear Regression

The simple linear regression model is when Xi is real-valued (one-dimensional) and r(x) is assumed to be linear:

r(x)=β0+β1x,and V(Y|X=x)=σ2 is independent of x

Thus simple linear regression model is the following:

Yi=β0+β1Xi+ϵi, where, E(ϵi|Xi)=0 and V(ϵi|Xi)=σ2

The unknown parameters and their estimates in the model are:

  • the intercept β0 and its estimate ˆβ0,
  • the slope β1 and its estimate ˆβ1 and
  • the variance σ2 and its estimate ˆσ2

The fitted line is: ˆr(x)=ˆβ0+ˆβ1x

The fitted or predicted values are: ˆYi=ˆr(Xi)

The residuals are: ˆϵi=YiˆYi=Yi(ˆβ0+ˆβ1Xi)

The residual sum of squares or RSS, that measures how well the line fits the data, is defined by RSS=ni=1ˆϵ2i

The least squares estimates are the values ˆβ0 and ˆβ1 that minimise RSS and they are given by:

ˆβ1=ni=1(Xi¯Xn)(Yi¯Yn)ni=1(Xi¯Xn)2,ˆβ0=¯Ynˆβ1¯Xn,ˆσ2=(1n2)ni=1ˆϵ2i

Interactive Animations for Regression

Check out:

Least Squares and Maximum Likelihood

Suppose we add the assumption about the model's noise that

ϵi|XiNormal(0,σ2) i.e., Yi|XiNormal(μi,σ2), where μi=β0+β1Xi

Then, the likelihood function is:

ni=1f(Xi,Yi)=ni=1fX(Xi)fY|X(Yi|Xi)=ni=1fX(Xi)ni=1fY|X(Yi|Xi)=:Ln,XLn,Y|X

where, Ln,X:=ni=1fX(Xi) is the marginal likelihood of X1,,Xn that does not depend on the parameters (β0,β1,σ), and Ln,Y|X:=ni=1fY|X(Yi|Xi) is the conditional likelihood that does depend on the parameters. Therefore the likelihood function is given by the conditional likelihood:

L(β0,β1,σ)ni=1f(Xi,Yi)Ln,Y|X=ni=1fY|X(Yi|Xi)σnexp(12σ2ni=1(Yiμi)2)

and the conditional log-likelihood is:

l(β0,β1,σ)=nlog(σ)12σ2ni=1(Yiμi)2

To find the MLE of (β0,β1) we need to maximise (β0,β1,σ) for a given σ. From the above expresion it is clear that maximising the log-likelihood is equivalent to minimising the residual sum of squares or RSS given by

ni=1(Yiμi)2

Therefore, we have shown the following Theorem.

Theorem [MLE is LSE]

Under the assumption of normally distributed noise, the maximum likelihood estimator (MLE) is the least squares estimator (LSE).

We can maximise l(β0,β1,σ) over σ and obtain the MLE for σ as follows:

ˆσ2=1nni=1 ˆϵ2.

But it is more common in practise to use the unbiased estimator, with E(ˆσ2)=σ2, that we saw earlier for sample size n>2:

ˆσ2=(1n2)ni=1ˆϵ2i.

Properties of the Least Squares Estimator (LSE)

It's finally time to obtain the standard errors and limititng distribution of the least quares estimator (also the MLE).

In regression we are interested in the properties of the estimators conditional on the covariates

X1:n:=(X1,X2,,Xn)

Conditional Mean and Variance of LSE

Let ˆβT=(ˆβ0,ˆβ1)T denote the least squares estimators (which is also the MLE). Then

E(ˆβ|X1:n)=(β0β1)V(ˆβ|X1:n)=σ2ns2X(1nni=1X2i¯Xn¯Xn1)

where,

s2X=1nni=1(Xi¯Xn)2

Estimated Standard Errors

The estimated standard errors for ˆβ0 and ˆβ1, or more precisely, the estimated standard errors conditional on the covariates, are given by the square-root of the diagonal terms of the variance-covariance matrix V(ˆβ|X1:n) and substituting the estimate ˆσ for σ, as follows:

^se(ˆβ0):=^se(ˆβ0|X1:n)=ˆσsXnni=1X2in^se(ˆβ1):=^se(ˆβ0|X1:n)=ˆσsXn

Thus under appropriate modeling assumptions in simple leinear regression we have the following four properties.

Four Asymptotic Properties of the LSE

1. Asymptotic Consistency

As n, the LSE, i.e. ˆβ0 and ˆβ1, converges in probability to the parameters, i.e., β0,β1, generating the data (Y1,X1),(Y2,X2),,(Yn,Xn) as summarised below.

ˆβ0Pβ0 and ˆβ1Pβ1

2. Asymptotic Normality

As n, the LSE, i.e. ˆβ0 and ˆβ1, converges in distribution to the parameters, i.e., β0,β1, generating the data (Y1,X1),(Y2,X2),,(Yn,Xn) as summarised below.

ˆβ0β0^se(ˆβ0)dNormal(0,1) and ˆβ1β1^se(ˆβ1)dNormal(0,1)

3. Approximate 1α Confidence Interval

The 1α confidence interval for β0 and β1 that is obtained from the approximately normal distribution as n gets large is:

ˆβ0±zα/2^se(ˆβ0) and ˆβ1±zα/2^se(ˆβ1)

4. The Wald Test

Recall Wald test statistic for testing the null hypothesis with the null value β(0):

H0:β=β(0) versus H1:ββ(0)isW=(ˆββ(0))^se(ˆβ)

Thus the Wald test for testing H0:β1=0 versus H1:β10 is to reject H0 if |W|>zα/2 where W=ˆβ1^se(ˆβ1).

Implementing Simple Linear Regression from Scratch

Using the above formulas we can implement Python functions to calculate the least squares estimates, ˆβ0 and ˆβ1, that minimise RSS.

In [1]:
import numpy as np 
import matplotlib.pyplot as plt

def estimate_coefficients(x, y): 
    # size of the dataset  
    n = np.size(x) 
    # mean of x and y
    mean_x, mean_y = np.mean(x), np.mean(y) 
    # xy cross-deviation and xx deviation
    SS_xy = np.sum(y*x - n*mean_y*mean_x) 
    SS_xx = np.sum(x*x - n*mean_x*mean_x)  
    # calculating LSE of regression coefficients 
    b1_hat = SS_xy / SS_xx 
    b0_hat = mean_y - b1_hat*mean_x 
    sigma_hat2 = np.mean((y - (b0_hat + b1_hat * x))^2)
    if n>2:
        sigma_hat2 = sigma_hat2*n/(n-2)
    sigma_hat=np.sqrt(sigma_hat2)
    return(b0_hat, b1_hat, sigma_hat)

def standard_errors(x,y):
    n = np.size(x) 
    b0_hat,b1_hat,s_hat = estimate_coefficients(x,y)
    mean_x = np.mean(x)
    s2X = np.mean( (x-mean_x)^2 )
    se_b1 = s_hat/np.sqrt(s2X*n)
    se_b0 = se_b1*np.sqrt(np.mean(x^2))
    return (se_b0, se_b1)

def plot_regression_line(x, y, b): 
    # plotting the data points on a graph
    plt.scatter(x, y, color = "m",marker = "o", s = 10) 
    # predicted response vector 
    y_pred = b[0] + b[1]*x 
    # plotting the fitted regression line
    plt.plot(x, y_pred, color = "b")
    # putting generic labels for x and y axis
    plt.xlabel('x') 
    plt.ylabel('y') 
    # function to show plotted graph
    plt.show()

def SimpleLinearRegression(x,y): 
    # estimating coefficients 
    b = estimate_coefficients(x, y) 
    print("Estimated coefficients:\nb0_hat = {} \nb1_hat = {}\nsigma_hat = {}".format(b[0], b[1],b[2])) 
    # plotting fitted regression line  with data
    plot_regression_line(x, y, b)
In [2]:
# Datasets for x and y 
LSAT=np.array([576, 635, 558, 578, 666, 580, 555, 661, 651, 605, 653, 575, 545, 572, 594]) # LSAT data
GPA=np.array([3.39, 3.30, 2.81, 3.03, 3.44, 3.07, 3.00, 3.43, 3.36, 3.13, 3.12, 2.74, 2.76, 2.88, 3.96]) # GPA data

SimpleLinearRegression(LSAT,GPA)
Estimated coefficients:
b0_hat = -0.000193932215061 
b1_hat = 0.00526687127757
sigma_hat = 0.287336087556

We can look at the residuals of the fitted line as follows.

In [31]:
predictedGPA = -0.000193932215061 + 0.00526687127757*LSAT
residuals = GPA - predictedGPA
plt.scatter(LSAT, residuals, color = "k",marker = "o", s = 10) 
plt.axhline()
# putting generic labels for x and y axis
plt.ylabel('$\epsilon_i$') 
plt.xlabel('LSAT') # draw a y=0 line
plt.show()
# in general we want residuals to be Normally distributes about 0 with the same variance

Residual Analysis

Looking at the residuals ϵi's in the above plot we can notice how just 4 of the 15 datapoints are abov 0. If ϵi were truly IID Normal(0,σ2), we would expect roughly the same number of points to be spread above and below zero, i.e., the x-axis, in an equally likely manner. Also, we would expect more points to be closer to zero and fewer points to be further away.

In conclusion, the residuals of the linear regression of LSAT and GPA do not look like they are normall distributed.

We could try different approaches to improve the model. For example, we could try to increase the sample size or standardise the scales by subtracting the sample mean and dividing by the the sample standard deviation for the x and y values separately and doing regression with the standardised data, etc.

The real wiki page has some simple examples of residual plots and they are useful for insights:

Examples of residual plots are shown in the following figure. (a) is a satisfactory plot with the residuals falling in a horizontal band with no systematic pattern. Such a plot indicates an appropriate regression model. (b) shows residuals falling in a funnel shape. Such a plot indicates increase in variance of residuals and the assumption of constant variance is violated here. Transformation on may be helpful in this case (see Transformations). If the residuals follow the pattern of (c) or (d), then this is an indication that the linear regression model is not adequate. Addition of higher order terms to the regression model or transformation on or may be required in such cases. A plot of residuals may also show a pattern as seen in (e), indicating that the residuals increase (or decrease) as the run order sequence or time progresses. This may be due to factors such as operator-learning or instrument-creep and should be investigated further.)

We can finally obtain 95% confidence intervals for the fitted parameters in the simple linear regression model and do a Wald test as follows.

In [4]:
b0_hat, b1_hat, s_hat = estimate_coefficients(LSAT,GPA)
se_b0,se_b1 = standard_errors(LSAT,GPA)
print "Estimated standard errors for beta_0_hat and beta_1_hat are:"
print se_b0,se_b1 
print "and the approximate 95% confidence intervals for beta_0_hat is:"
print "          [ ", b0_hat-2*se_b0," , ", b0_hat+2*se_b0, " ]"
print "and the approximate 95% confidence intervals for beta_1_hat is:"
print "          [ ", b1_hat-2*se_b1," , ", b1_hat+2*se_b1, " ]"
print "The Wald test for the null hypothesis H0 that beta_1 = 0 is:"
W = (b1_hat-0)/se_b1
if abs(W > 2):
    print "Reject H0 that beta_1=0 at alpha=0.05, since W = ",W
else:
    print "fail to reject H0 that beta_1=0 at alpha=0.05, since W = ",W
Estimated standard errors for beta_0_hat and beta_1_hat are:
1.1054305503607234 0.0018374136245919459
and the approximate 95% confidence intervals for beta_0_hat is:
          [  -2.2110550329365077  ,  2.2106671685063857  ]
and the approximate 95% confidence intervals for beta_1_hat is:
          [  0.0015920440283845134  ,  0.008941698526752296  ]
The Wald test for the null hypothesis H0 that beta_1 = 0 is:
Reject H0 that beta_1=0 at alpha=0.05, since W =  2.8664592485200906

Multiple Regression

This is just as simple, except we have more than one covariate

Now, let's suppose that the covariate, feature, predictor or dependent variable is a vector of length k. So our data for regression is of the following form:

(Y1,X1),(Y2,X2),,(Yi,Xi),,(Yn,Xn)

where, Xi is a vector of length k for the i-th observation or datapoint (Yi,Xi).

Xi=(Xi,1,Xi,2,,Xi,k).

Then the linear regression model is:

Yi=kj=0βjXi,j+ϵi, for i{1,2,,n}

where β0 is the intercept term with Xi,0=1 for each i{1,2,,n} and

E(ϵi|X1,i,X2,i,,Xk,i)=0.

We can denote the model using matrices and vectors more conveniently as follows:

Y=(Y1Y2Yn),X=(1X1,1X1,k1X2,1X2,k1Xn,1Xn,k),β=(β0β1βk),ϵ=(ϵ1ϵ2ϵn).

With XRn×(k+1), i.e., X being a n×(k+1) matrix, βR(k+1)×1, i.e., β being a a column vector with k+1 rows, and ϵRn×1, i.e., ϵ being a column vector with n rows, we obtain the multiple regression model:

Y=Xβ+ϵ.

Just as in the 1D case with k=1, the least sqaures estimate is as follows, under the assumption that XTX is invertible:

ˆβ=(XTX)1XTYV(ˆβ|X1:n)=σ2(XTX)1ˆβNormal(β,σ2(XTX)1).

The estimate of the regression function is:

ˆr(x)=kj=0ˆβjxj.

An unbiased estimate of σ2 is:

ˆσ2=(1nk)ni=1ˆϵ2i

where ˆϵ is the vector of residuals:

ˆϵ=XˆβY , i.e.,ˆϵ=(ˆϵ1ˆϵ2ˆϵn)=(1X1,1X1,k1X2,1X2,k1Xn,1Xn,k) (ˆβ0ˆβ1ˆβk) (Y1Y2Yn)

An approximate 1α confidence interval for βj is

ˆβj±zα/2^se(ˆβj)

where (^se(ˆβj))2 is the j-th diagonal entry of the matrix ˆσ2(XTX)1.

Solving Least Squares Using Numerical Linear Algebra Routine in scipy

We can use scipy.linalg.lstsq to get the least squares solution to our regression problems quite easily, including generalisation to multiple linear regression when the covariates are in more than 1 dimension.

Let us try to understand the code in the previous cell by learning how to do a least squares fit by setting up the right design matrix.

Example 1: Fitting a Line is Simple Linear Regression

In [5]:
from scipy.linalg import lstsq
import matplotlib.pyplot as plt
import numpy as np

# suppose we have the following data
x = np.array([1, 2.5, 3.5, 4, 5, 7, 8.5])
y = np.array([0.3, 1.1, 1.5, 2.0, 3.2, 6.6, 8.6])

#We want to fit a line of the form y = a + b*x to this data. We first form the 
#“design matrix” M, with a constant column of 1s and a column containing x
M1 = x[:, np.newaxis]^[0, 1]
M1
Out[5]:
array([[1. , 1. ],
       [1. , 2.5],
       [1. , 3.5],
       [1. , 4. ],
       [1. , 5. ],
       [1. , 7. ],
       [1. , 8.5]])
In [6]:
#We want to find the least-squares solution to 
#M1.dot(p) = y, where p is a vector with length 2 that holds the parameters a and b.
p, res, rnk, s = lstsq(M1, y)
p
Out[6]:
array([-1.93080357,  1.16875   ])
In [7]:
plt.plot(x, y, 'o', label='data')
xx = np.linspace(0, 9, 101)
yy = p[0] + p[1]*xx
plt.plot(xx, yy, label='least squares fit, $y = a + bx$')
plt.xlabel('x')
plt.ylabel('y')
plt.legend(framealpha=1, shadow=True)
plt.grid(alpha=0.25)
plt.show()

Example 2: Fitting a Quadratic is also Simple Linear Regresssion

Suppose we want to fit a quadratic polynomial of the form y=a+bx2 to the same data. Then we first form the design matrix M2, with a constant column of 1s and a column containing x^2 as follows:

In [8]:
M2 = x[:, np.newaxis]^[0, 2]
M2
Out[8]:
array([[ 1.  ,  1.  ],
       [ 1.  ,  6.25],
       [ 1.  , 12.25],
       [ 1.  , 16.  ],
       [ 1.  , 25.  ],
       [ 1.  , 49.  ],
       [ 1.  , 72.25]])
In [9]:
# least square solution with M2
p, res, rnk, s = lstsq(M2, y)
plt.plot(x, y, 'o', label='data')
xx = np.linspace(0, 9, 101)
yy = p[0] + p[1]*xx^2
plt.plot(xx, yy, label='least squares fit, $y = a + bx$')
plt.xlabel('x')
plt.ylabel('y')
plt.legend(framealpha=1, shadow=True)
plt.grid(alpha=0.25)
plt.show()

Example 3: Fitting a 3rd Order Polynomial is Multiple Linear Regresssion

Suppose we want to fit a degree-3 polynomial of the form y=β0+β1x+β2x2+β3x3 to the same data. Then we first form the design matrix M3, with a constant column of 1s with x^0 and three additional columns containing x^1, x^2 and x^3 as follows:

In [10]:
# Fitting a cubic polynolial is the same idea
M3 = x[:, np.newaxis]^[0, 1, 2, 3]
M3
Out[10]:
array([[  1.   ,   1.   ,   1.   ,   1.   ],
       [  1.   ,   2.5  ,   6.25 ,  15.625],
       [  1.   ,   3.5  ,  12.25 ,  42.875],
       [  1.   ,   4.   ,  16.   ,  64.   ],
       [  1.   ,   5.   ,  25.   , 125.   ],
       [  1.   ,   7.   ,  49.   , 343.   ],
       [  1.   ,   8.5  ,  72.25 , 614.125]])
In [11]:
p, res, rnk, s = lstsq(M3, y)
plt.plot(x, y, 'o', label='data')
xx = np.linspace(0, 9, 101)
yy = p[0] + p[1]*xx + p[2]*xx^2 + p[3]*xx^3
plt.plot(xx, yy, label='least squares fit, $y = a + bx$')
plt.xlabel('x')
plt.ylabel('y')
plt.legend(framealpha=1, shadow=True)
plt.grid(alpha=0.25)
plt.show()

Sample Exam Problem 8

Using the lstsq method shown above, and data arrays x and y in the next cell that contain log light intensity and log surface temperature in a give range of measurements from nearby stars, compute the least squares estimates of β0 and β1 under the simple linear regression model with an intercept and a slope term. Make a plot similar to the one above with the data points and the fitted regression line.

In [ ]:
# Sample Exam Problem 8 
# do not change this import and data block ########################
from scipy.linalg import lstsq
import matplotlib.pyplot as plt
import numpy as np
logLightIntens_logSurfTemp=[(4.37,5.23),(4.56,5.74),
(4.26,4.93),(4.56,5.74),(4.30,5.19),(4.46,5.46),(3.84,4.65),(4.57,5.27),(4.26,5.57),(4.37,5.12),(3.49,5.73),
(4.43,5.45),(4.48,5.42),(4.01,4.05),(4.29,4.26),(4.42,4.58),(4.23,3.94),(4.42,4.18),(4.23,4.18),(3.49,5.89),
(4.29,4.38),(4.29,4.22),(4.42,4.42),(4.49,4.85),(4.38,5.02),(4.42,4.66),(4.29,4.66),(4.38,4.90),(4.22,4.39),
(3.48,6.05),(4.38,4.42),(4.56,5.10),(4.45,5.22),(3.49,6.29),(4.23,4.34),(4.62,5.62),(4.53,5.10),(4.45,5.22),
(4.53,5.18),(4.43,5.57),(4.38,4.62),(4.45,5.06),(4.50,5.34),(4.45,5.34),(4.55,5.54),(4.45,4.98),(4.42,4.50)]
CleanedlogLightIntens_logSurfTemp=\
np.array([yx for yx in logLightIntens_logSurfTemp if yx[1]<5.9 and yx[0]>4]) # data range constraint
x=CleanedlogLightIntens_logSurfTemp[:,1]
y=CleanedlogLightIntens_logSurfTemp[:,0]
########### end of import and data block ##########################

# Replace only ZZZ by the right values
M1 = ZZZ # design matrix M1
b, res, rnk, s = lstsq(ZZZ,ZZZ)
plt.plot(x, y, 'o', label='data')
xx = np.linspace(ZZZ, ZZZ, 101)
yy = ZZZ *xx
plt.plot(xx, yy, label='least squares fit')
plt.xlabel('log light intensity (X)')
plt.ylabel('log surface temperature (Y)')
plt.legend(framealpha=1, shadow=True)
plt.grid(alpha=0.25)
plt.text(4, 4.7, r'$\widehat{r}(x) = \widehat{\beta}_0 + \widehat{\beta}_1 x, \quad \
\widehat{\beta}_0 = $ %(b0)0.3f , $\widehat{\beta}_1 = $ %(b1)0.3f' % {'b0': b[0], 'b1': b[1]} )
plt.show()
In [12]:
# Sample Exam Problem 8 Solution
logLightIntens_logSurfTemp=[(4.37,5.23),(4.56,5.74),
(4.26,4.93),(4.56,5.74),(4.30,5.19),(4.46,5.46),(3.84,4.65),(4.57,5.27),(4.26,5.57),(4.37,5.12),(3.49,5.73),
(4.43,5.45),(4.48,5.42),(4.01,4.05),(4.29,4.26),(4.42,4.58),(4.23,3.94),(4.42,4.18),(4.23,4.18),(3.49,5.89),
(4.29,4.38),(4.29,4.22),(4.42,4.42),(4.49,4.85),(4.38,5.02),(4.42,4.66),(4.29,4.66),(4.38,4.90),(4.22,4.39),
(3.48,6.05),(4.38,4.42),(4.56,5.10),(4.45,5.22),(3.49,6.29),(4.23,4.34),(4.62,5.62),(4.53,5.10),(4.45,5.22),
(4.53,5.18),(4.43,5.57),(4.38,4.62),(4.45,5.06),(4.50,5.34),(4.45,5.34),(4.55,5.54),(4.45,4.98),(4.42,4.50)]
CleanedlogLightIntens_logSurfTemp=\
np.array([yx for yx in logLightIntens_logSurfTemp if yx[1]<5.9 and yx[0]>4]) # data range constraint
x=CleanedlogLightIntens_logSurfTemp[:,1]
y=CleanedlogLightIntens_logSurfTemp[:,0]

from scipy.linalg import lstsq
import matplotlib.pyplot as plt
import numpy as np
M1 = x[:, np.newaxis]^[0, 1]
b, res, rnk, s = lstsq(M1, y)
plt.plot(x, y, 'o', label='data')
xx = np.linspace(3.9, 5.8, 101)
yy = b[0] + b[1]*xx
plt.plot(xx, yy, label='least squares fit')
plt.xlabel('log light intensity (X)')
plt.ylabel('log surface temperature (Y)')
plt.legend(framealpha=1, shadow=True)
plt.grid(alpha=0.25)
plt.text(4, 4.7, r'$\widehat{r}(x) = \widehat{\beta}_0 + \widehat{\beta}_1 x, \quad \
\widehat{\beta}_0 = $ %(b0)0.3f , $\widehat{\beta}_1 = $ %(b1)0.3f' % {'b0': b[0], 'b1': b[1]} )
plt.show()

Assignment 3, PROBLEM 8

Maximum Points = 2

For the fitted regression model in the next cell get the residuals and plot them against the covariate [see Residual analysis section in latest 12.ipynb for the basic ideas conveyed in the last lecture]. How do the residuals compare to a Normal random variable centred at 0 with a constant variance (summarise in a sentence or two by double-clicking this cell and writing in between the two lines --- below)?



In [ ]:
logLightIntens_logSurfTemp=[(4.37,5.23),(4.56,5.74),
(4.26,4.93),(4.56,5.74),(4.30,5.19),(4.46,5.46),(3.84,4.65),(4.57,5.27),(4.26,5.57),(4.37,5.12),(3.49,5.73),
(4.43,5.45),(4.48,5.42),(4.01,4.05),(4.29,4.26),(4.42,4.58),(4.23,3.94),(4.42,4.18),(4.23,4.18),(3.49,5.89),
(4.29,4.38),(4.29,4.22),(4.42,4.42),(4.49,4.85),(4.38,5.02),(4.42,4.66),(4.29,4.66),(4.38,4.90),(4.22,4.39),
(3.48,6.05),(4.38,4.42),(4.56,5.10),(4.45,5.22),(3.49,6.29),(4.23,4.34),(4.62,5.62),(4.53,5.10),(4.45,5.22),
(4.53,5.18),(4.43,5.57),(4.38,4.62),(4.45,5.06),(4.50,5.34),(4.45,5.34),(4.55,5.54),(4.45,4.98),(4.42,4.50)]
CleanedlogLightIntens_logSurfTemp=\
np.array([yx for yx in logLightIntens_logSurfTemp if yx[1]<5.9 and yx[0]>4]) # data range constraint
x=CleanedlogLightIntens_logSurfTemp[:,1]
y=CleanedlogLightIntens_logSurfTemp[:,0]

from scipy.linalg import lstsq
import matplotlib.pyplot as plt
import numpy as np
M1 = x[:, np.newaxis]^[0, 1]
b, res, rnk, s = lstsq(M1, y)
plt.plot(x, y, 'o', label='data')
xx = np.linspace(3.9, 5.8, 101)
yy = b[0] + b[1]*xx
plt.plot(xx, yy, label='least squares fit')
plt.xlabel('log light intensity (X)')
plt.ylabel('log surface temperature (Y)')
plt.legend(framealpha=1, shadow=True)
plt.grid(alpha=0.25)
plt.text(4, 4.7, r'$\widehat{r}(x) = \widehat{\beta}_0 + \widehat{\beta}_1 x, \quad \
\widehat{\beta}_0 = $ %(b0)0.3f , $\widehat{\beta}_1 = $ %(b1)0.3f' % {'b0': b[0], 'b1': b[1]} )
plt.show()

# Obtain the residuals and plot them (summarise in the markdown cell above)
XXX
XXX
XXX

Prediction

Let's consider the 1D setting for simplicity of notation. Suppose we have estimated a regression model: ˆr(x)=ˆβ0+ˆβ1x from data (X1,Y1),(X2,Y2),,(Xn,Yn).

Now suppose we observe the value X=x of the covariate of a new observarion but do not observe the response Y and want to predict it. An estimate of Y is

ˆY=ˆβ0+ˆβ1x.

By the formula for the variance of the sum of two random variables:

V(ˆY)=V(ˆβ0+ˆβ1x)=V(ˆβ0)+x2V(ˆβ1)+2xCov(ˆβ0,ˆβ1)

We have all the needed terms to compute V(ˆY) from the earlier result on the conditional variance of the least squares estimate:

V(ˆβ|X1:n)=σ2ns2X(1nni=1X2i¯Xn¯Xn1)

The estimated standard error ^se(ˆY) is just V(ˆY) with ˆσ2 substituted in for σ2. An approximate 1alpha confidence interval for Y is called an approximate 1α prediction interval for Y and is given by

ˆY±zα/2ˆξn, where ˆξ2n=ˆσ2(ni=1(XiX)2nni=1(Xi¯X)2+1)

Multiple Regression on 2018 Swedish Election Data

If you are interested, you already have the basic skills to look at the data from Swedish election using these ideas.

Try to model, say the log of the number of district-level votes for the top two most voted parties.

You can introduce latitude of the district centres (if you have such information from geospatial database you could join), distance of the district to one of the four largest cities in Sweden, or the socio-economic indicators of the district for Swedish Central Statistical Bureau, etc., as covariates.

But this is a good project and beyond current scope (mainly due to time limitations).

Prelude to Statistical Machine Learning

Here, we just start you off on the path to more statistical modeling for purposes of prediction. Now statistical learning from the 1970s is needed to mathematically justify the methods.

The following is a teaser of what you will see in the first couple weeks of you course in 'statistical machine learning'.

Loss functions and gradient descent

[this header was adapted from some notes by Benny Avelin]

In the above example with linear regression we wanted to minimize the vertical distance between the fitted line and the data, this vertical distance is a prime example of a loss function. In general when we are faced with a regression problem we want a way of measure how good our model is, this quantity that we want to minimise is called the loss function and its expectation (over different sample data-points is called the risk). The mathematical statistical justification for this approach towards minimising expected loss or risk is called empirical risk minimisation, as we will see in the sequel in more detail.

Let us circle back to linear regression once again. The way the np.argmin method searched for the minimum of:

L(a,b)=Ni=1(yifa,b(xi))2

was by simply evaluating L(a,b) for each value of a in the array prop_a with our guessed values for a and picking the a that minimised L(a,b). Recall we fixed b in the search.

np.argmin? # see the docstring for np.argmin and other functions/methods we are using throughout if you need to know right away.

This approaching of evaluating the loss at a set of parameter values quickly becomes infeasible when the dimension of the problem is larger than 1.

Even if we just have two guess for each dimension of the parameter space with d dimensions, then we will need to evaluate the loss at 2d parameter values. When d=10,100,1000 the number of evaluation points become 1024, 1.268e30,1.072e301, respectively.

Often in big-data settings, the number of dimensions for the regression problem can easily extend over a few thousands. Thus, we need a systematic way to find the optimal parameters, i.e., the parameters that minimise the loss function.

The iterative solution is called gradient descent and it goes like this:

  • Initialise: Let us start with some initial parameters, say in our linear regression example (a,b)=(0,0), say at iteration i=0.
  • Update: then we construct an update rule like the following to update the parameter values at i+1 from those at iteration i:
    • ai=ai1ldLda(ai1,bi1)
    • bi=bi1ldLdb(ai1,bi1)
    • where l>0 is called the learning rate.
  • Stop: Finally we stop when a stopping rule like the following is satisfied: ((L(ai+1,bi+1)L(ai,bi))2)<τ, where, τ is some tolerance threshold that says we are close enough to the minimum value found by our iteration.

Introduction to R in SageMath Jupyter IPython Notebook

  1. How to run R commands in SageMath
    • doing linear regression regression using R's builtin lm (linear model) package in SageMath/R
    • installing non-builtin packages, loading libraries and data

Running R in SageMath is "easy as":

  • Use %%r to denote that the Code cell is of language R

First note that SageMath/Python and R kernels will be available in the SageMath Jupyter notebook.

In [14]:
# this is x and y available as numpy arrays in SageMath/Python
logLightIntens_logSurfTemp=[(4.37,5.23),(4.56,5.74),
(4.26,4.93),(4.56,5.74),(4.30,5.19),(4.46,5.46),(3.84,4.65),(4.57,5.27),(4.26,5.57),(4.37,5.12),(3.49,5.73),
(4.43,5.45),(4.48,5.42),(4.01,4.05),(4.29,4.26),(4.42,4.58),(4.23,3.94),(4.42,4.18),(4.23,4.18),(3.49,5.89),
(4.29,4.38),(4.29,4.22),(4.42,4.42),(4.49,4.85),(4.38,5.02),(4.42,4.66),(4.29,4.66),(4.38,4.90),(4.22,4.39),
(3.48,6.05),(4.38,4.42),(4.56,5.10),(4.45,5.22),(3.49,6.29),(4.23,4.34),(4.62,5.62),(4.53,5.10),(4.45,5.22),
(4.53,5.18),(4.43,5.57),(4.38,4.62),(4.45,5.06),(4.50,5.34),(4.45,5.34),(4.55,5.54),(4.45,4.98),(4.42,4.50)]
CleanedlogLightIntens_logSurfTemp=\
np.array([yx for yx in logLightIntens_logSurfTemp if yx[1]<5.9 and yx[0]>4]) # data range constraint
x=CleanedlogLightIntens_logSurfTemp[:,1]
y=CleanedlogLightIntens_logSurfTemp[:,0]
print x
print y
[5.23 5.74 4.93 5.74 5.19 5.46 5.27 5.57 5.12 5.45 5.42 4.05 4.26 4.58
 3.94 4.18 4.18 4.38 4.22 4.42 4.85 5.02 4.66 4.66 4.9  4.39 4.42 5.1
 5.22 4.34 5.62 5.1  5.22 5.18 5.57 4.62 5.06 5.34 5.34 5.54 4.98 4.5 ]
[4.37 4.56 4.26 4.56 4.3  4.46 4.57 4.26 4.37 4.43 4.48 4.01 4.29 4.42
 4.23 4.42 4.23 4.29 4.29 4.42 4.49 4.38 4.42 4.29 4.38 4.22 4.38 4.56
 4.45 4.23 4.62 4.53 4.45 4.53 4.43 4.38 4.45 4.5  4.45 4.55 4.45 4.42]

Assigning to x and y in SageMath/R

We use the assignment operator, <-, in R, as follows:

In [15]:
%%r
x <- c(5.23,  5.74,  4.93,  5.74,  5.19,  5.46,  5.27,  5.57,  5.12,
         5.45,  5.42,  4.05,  4.26,  4.58,  3.94,  4.18,  4.18,  4.38,
         4.22,  4.42,  4.85,  5.02,  4.66,  4.66,  4.9 ,  4.39,  4.42,
         5.1 ,  5.22,  4.34,  5.62,  5.1 ,  5.22,  5.18,  5.57,  4.62,
         5.06,  5.34,  5.34,  5.54,  4.98,  4.5)
y <- c(4.37,  4.56,  4.26,  4.56,  4.3 ,  4.46,  4.57,  4.26,  4.37,
         4.43,  4.48,  4.01,  4.29,  4.42,  4.23,  4.42,  4.23,  4.29,
         4.29,  4.42,  4.49,  4.38,  4.42,  4.29,  4.38,  4.22,  4.38,
         4.56,  4.45,  4.23,  4.62,  4.53,  4.45,  4.53,  4.43,  4.38,
         4.45,  4.5 ,  4.45,  4.55,  4.45,  4.42)
 [1] 4.37 4.56 4.26 4.56 4.30 4.46 4.57 4.26 4.37 4.43 4.48 4.01 4.29 4.42 4.23
[16] 4.42 4.23 4.29 4.29 4.42 4.49 4.38 4.42 4.29 4.38 4.22 4.38 4.56 4.45 4.23
[31] 4.62 4.53 4.45 4.53 4.43 4.38 4.45 4.50 4.45 4.55 4.45 4.42

Doing Linear Regression in SameMath/R

In [16]:
%%r
linearRegressionModel <- lm(formula = y ~ x + I(x^2))

summary(linearRegressionModel)
Call:
lm(formula = y ~ x + I(x^2))

Residuals:
     Min       1Q   Median       3Q      Max 
-0.22916 -0.05145  0.01121  0.06263  0.16072 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  1.87480    1.44471   1.298    0.202
x            0.87443    0.59759   1.463    0.151
I(x^2)      -0.07272    0.06128  -1.187    0.243

Residual standard error: 0.09108 on 39 degrees of freedom
Multiple R-squared:  0.484,	Adjusted R-squared:  0.4576 
F-statistic: 18.29 on 2 and 39 DF,  p-value: 2.491e-06

Running R in SageMath is "easy as":

Sometimes you need additional R packages.

  • Installing R packages with install.packages(...)

Note: Once a package is installed on a particular machine using install.packages("wantedpackage") then you only need to load that library using library(wantedpackage) when you are using the same machine.

Additional Packages

One often needs several additional packages to run certain desired R commands. Let's get some such packages.

In other words, you don't have to install packages that are already installed and thus can be automatically found by R in the default location it will be installed at. In the case below, you can see where the package was installed from the following line:

  • Installing package into ‘/some_path_to_where_the_package_is_installed’
In [ ]:
%%r
# there will be further dependencies, you may need to recursively install...
#install.packages("Flury")
#library(Flury)
#data(dead.beetles)

SageMath/R docs

For example, you can find in the docs more systematic/programmatic way to assign SageMath/Python objects to SageMath/R objects.