Applied Statistics

1MS926, Spring 2019, Uppsala University

©2019 Raazesh Sainudiin. Attribution 4.0 International (CC BY 4.0)

Assignment 3 for Course 1MS926

Fill in your Personal Number, make sure you pass the # ... Test cells and submit by email from your official uu.se student email account to raazesh.sainudiin @ math.uu.se with Subject line 1MS926 Assignment 3. You can submit multiple times before the deadline and your highest score will be used.

In [ ]:
# Enter your 12 digit personal number here and evaluate this cell
MyPersonalNumber = 'YYYYMMDDXXXX'

#tests
assert(isinstance(MyPersonalNumber, basestring))
assert(MyPersonalNumber.isdigit())
assert(len(MyPersonalNumber)==12)

Assignment 3, PROBLEM 1

Maximum Points = 1

Using the steps in the Sample Exam Problem 1 with Solution in notebook 09.ipynb, derive the maximum likelihood estimate for $n$ IID samples from a random variable with the following probability density function: $$ f(x; \lambda) = \frac{1}{24} \lambda^5 x^4 \exp(-\lambda x), \qquad \text{ where, } \lambda>0, x > 0 $$

You can solve the MLe by hand (using pencil paper or using key-strokes). Present your solution as the return value of a function called def MLeForAssignment3Problem1(x), where x is a list of $n$ input data points.

In [ ]:
# do not change the name of the function, just replace XXX with the appropriate expressions for the MLe
def MLeForAssignment3Problem1(x):
    '''write comment of what this function does'''
    XXX 
    XXX
    return XXX

Assignment 3, PROBLEM 2

Maximum Points = 2

Joshua Fenemore collected data in 2007 on waiting times at the Orbiter bus-stop close to University of Canterbury, Christchurch, New Zealand. The sampled waiting times at the bus-stop, i.e., the inter-arrival time between consecutive buses, were recored in units of nearest minute and are given in the list sampleWaitingTimes below.

Your task is to assume that the inter-arrival time between Orbiter buses is independent and identically distributed according to $Exponential(\lambda)$ random variable and write a generic function:

  • called mleOfExponetialLambdaRVFromIIDSamples(samples),
  • where samples is a list of data points assumed to be drawn from IID $Exponential(\lambda)$ random variable,
  • such that, the function returns the maximum likelihood estimate or MLe $\widehat{\lambda}_n$ of the unknown rate parameter $\lambda$
  • finally get the MLe for Joshua's data by calling your function on his samples as follows:
    • mleOfRateParameterForOrbiterWaitingTimes = mleOfExponetialLambdaRVFromIIDSamples(sampleWaitingTimes)

(NOTE: The MLe for this model was already derived "above", i.e., in 09.ipynb, so you can directly use this expression to complete your task).

In [245]:
# do not change the sampled data-points in sampleWaitingTimes
sampleWaitingTimes=[8,3,7,18,18,3,7,9,9,25,0,0,25,6,10,0,10,8,16,9,1,5,16,6,4,1,3,21,0,28,3,8,6,6,11,\
                    8,10,15,0,8,7,11,10,9,12,13,8,10,11,8,7,11,5,9,11,14,13,5,8,9,12,10,13,6,11,13,0,\
                    0,11,1,9,5,14,16,2,10,21,1,14,2,10,24,6,1,14,14,0,14,4,11,15,0,10,2,13,2,22,10,5,\
                    6,13,1,13,10,11,4,7,9,12,8,16,15,14,5,10,12,9,8,0,5,13,13,6,8,4,13,15,7,11,6,23,1]
# do not change the next line, but just replace XXX - 1 POINT
# in the body of the function to complete your task
def mleOfExponetialLambdaRVFromIIDSamples(samples):
    '''XXX'''
    XXX
    XXX
    return XXX

# You should not change anything in the line below - 1 POINT
mleOfRateParameterForOrbiterWaitingTimes = mleOfExponetialLambdaRVFromIIDSamples(sampleWaitingTimes)

Assignment 3, PROBLEM 3

Maximum Points = 2

Use Bounded 1D Optimisation to find the maximum likelihood estimate for the IID $Bernoulli(\theta)$ experiment using data in the array dataSamples below.

HINT: First, Study the Solution to the Sample Exam Problem 2.

In [ ]:
import numpy as np
from scipy import optimize
# do not change next line - dataSamplesCoinToss is the sampled data-points
dataSamplesCoinToss= np.array([0,0,1,1,1,1,1,0,0,1,1,0,0,1,1,0,0])

# finding MLE for IID Bernoulli(theta) RV
# do not Chnage the name of the next function - just replace XXX
def negLogLklOBernoulliIIDSamples(paramLam):
    '''XXX'''
    XXX
    XXX
    return XXX

theta_initial=XXX

# do NOT change the next two lines
boundedBernoulliResult = optimize.minimize_scalar(negLogLklOBernoulliIIDSamples, theta_initial, \
                                                  bounds=(0.001, 0.099), method='bounded')
boundedBernoulliResult

Assignment 3, PROBLEM 4

Maximum Points = 2

Use the Multi-dimensional Constrained Optimisation example above (in 09.ipynb) to numerically find the MLe for the mean and variance parameter based on normallySimulatedDataSamples, an array obtained by a specific simulation of $30$ IID samples from the $Normal(10,2)$ random variable.

Recall that $Normal(\mu, \sigma^2)$ RV has the probability density function given by:

$$ f(x ;\mu, \sigma) = \displaystyle\frac{1}{\sigma\sqrt{2\pi}}\exp\left(\frac{-1}{2\sigma^2}(x-\mu)^2\right) $$

The two parameters, $\mu \in \mathbb{R} := (-\infty,\infty)$ and $\sigma \in (0,\infty)$, are sometimes referred to as the location and scale parameters.

You know that the log likelihood function for $n$ IID samples from a Normal RV with parameters $\mu$ and $\sigma$ simply follows from $\sum_{i=1}^n \log(f(x_i; \mu,\sigma))$, based on the IID assumption.

NOTE: When setting bounding boxes for $\mu$ and $\sigma$ try to start with some guesses like $[-20,20]$ and $[0.1,5.0]$ and make it larger if the solution is at the boundary. Making the left bounding-point for $\sigma$ too close to $0.0$ will cause division by zero Warnings. Other numerical instabilities can happen in such iterative numerical solutions to the MLe. You need to be patient and learn by trial-and-error. You will see the mathematical theory in more details in a future course in scientific computing/optimisation. So don't worry too much now except learning to use it for our problems.

In [ ]:
import numpy as np
from scipy import optimize
# do NOT change the next three lines
np.random.seed(123456) # set seed
# simulate 30 IID samples drawn from Normal(10,2)RV
normallySimulatedDataSamples = np.random.normal(10,2,30) 

# define the negative log likelihoo function you want to minimise by editing XXX
def negLogLklOfIIDNormalSamples(parameters):
    '''return the -log(likelihood) of normallySimulatedDataSamples with mean and var parameters'''
    mu_param=parameters[0]
    sigma_param=parameters[1]
    XXX
    XXX # add more or less lines as you need
    return XXX 

# you should only change XXX below and not anything else
parameter_bounding_box=((XXX, XXX), (XXX, XXX)) # specify the constraints for each parameter - some guess work...
initial_arguments = np.array([XXX, XXX]) # point in 2D to initialise the minimize algorithm
result_Ass3Prob4 = optimize.minimize(XXX, initial_arguments, bounds=parameter_bounding_box, XXX) 
# call the minimize method above finally! you need to play a bit to get initial conditions and bounding box ok
result_Ass3Prob4

Assignment 3, PROBLEM 5

Maximum Points = 3

Obtain the 95% CI based on the asymptotic normality of the MLE for the mean paramater $\lambda$ based on $n$ IID $Poisson(\lambda^*)$ trials.

Recall that a random variable $X \sim Poisson(\lambda)$ if its probability mass function is:

$$ f(x; \lambda) = \exp{(-\lambda)} \frac{\lambda^x}{x!}, \quad \lambda > 0, \quad x \in \{0,1,2,\ldots\} $$

The MLe $\widehat{\lambda}_n = \overline{x}_n$, the sample mean.

Work out your answer and express it in the next cell by replacing XXXs.

In [ ]:
# Only replace the XXX below, do not change the function naemes or parameters
samplePoissonCounts = np.array([0,5,11,5,6,8,9,0,1,14,2,4,4,11,2,12,10,5,6,1,7,9,8,0,5,7,11,6,0,1])

def Assignment3Problem5(poissonSamples):
    '''return the 95% confidence interval as a 2-tuple for the unknown rate parameter lambda* 
    from n IID Exponential(lambda*) trials in the input numpy array called exponentialSamples'''
    XXX
    XXX
    XXX
    lower95CI=XXX
    upper95CI=XXX
    return (lower95CI,upper95CI)

# do NOT change anything below
lowerCISampleExamProblem5,upperCISampleExamProblem5 = Assignment3Problem5(samplePoissonCounts)
print "The 95% CI for lambda based on IID Poisson(lambda) data in samplePoissonCounts = "
print (lowerCISampleExamProblem5,upperCISampleExamProblem5)

Assignment 3, PROBLEM 6

Maximum Points = 3

For the Orbiter waiting time problem, assuming IID trials as follows:

$$\displaystyle{X_1,X_2,\ldots,X_{n} \overset{IID}{\sim} Exponential(\lambda^*)}$$

Your task is to perform a Wald Test of size $\alpha=0.05$ to try to reject the null hypothesis that the waiting time at the Orbiter bus-stop, i.e., the inter-arrival time between buses, is exactly $10$ minutes:

$$\displaystyle{H_0: \lambda^*=\lambda_0 \quad \text{ versus } \quad H_1: \lambda^* \neq \lambda_0, \qquad \text{ with }\lambda_0=0.1}$$

Show you work by replacing XXXs with the right expressions in the next cell.

In [ ]:
sampleWaitingTimes = np.array([8,3,7,18,18,3,7,9,9,25,0,0,25,6,10,0,10,8,16,9,1,5,16,6,4,1,3,21,0,28,3,8,6,6,11,\
                               8,10,15,0,8,7,11,10,9,12,13,8,10,11,8,7,11,5,9,11,14,13,5,8,9,12,10,13,6,11,13,0,\
                               0,11,1,9,5,14,16,2,10,21,1,14,2,10,24,6,1,14,14,0,14,4,11,15,0,10,2,13,2,22,10,5,\
                               6,13,1,13,10,11,4,7,9,12,8,16,15,14,5,10,12,9,8,0,5,13,13,6,8,4,13,15,7,11,6,23,1])

#test H0: lambda=0.1
## STEP 1: get the MLE thetaHat
lambdaHat=XXX # you need to use sampleWaitingTimes here!
print "mle lambdaHat = ",lambdaHat

## STEP 2: get the NullLambda or lambda0
NullLambda=XXX
print "Null value of lambda under H0 = ", NullLambda

## STEP 3: get estimated standard error
seLambda=XXX # see Sample Exam Problem 5 in 10.ipynb
print "estimated standard error",seLambda

# STEP 4: get Wald Statistic
W=XXX
print "Wald staatistic = ",W

# STEP 5: conduct the size alpha=0.05 Wald test
# do NOT change anything below
rejectNullAssignment3Problem6 = abs(W) > 2.0 # alpha=0.05, so z_{alpha/2} =1.96 approx=2.0
if (rejectNullAssignment3Problem6):
    print "we reject the null hypothesis that lambda0=0.1"
else:
    print "we fail to reject the null hypothesis that lambda0=0.1"

Assignment 3, PROBLEM 7

Maximum Points = 2

Repeat the three steps to perform a bootstrap above as done in Median of inter-EQ Time example (notebook 11.ipynb) to find the plug-in estimate and 95% CI for the 99-th Percentile of the inter-EQ time in minutes.

You just need to evaluate the next REQUIRED-CELL and replace XXX with the right expressions in the following cell.

HINT: Median is the $50$-th Percentile.

In [39]:
# REQUIRED-CELL

# DO NOT MODIFY this cell 
# Evaluate this cell before trying this PROBLEM so that the required functions and variables are loaded
import numpy as np
## Be Patient! - This will take more time, about a minute or so
###############################################################################################
def getLonLatMagDepTimes(NZEQCsvFileName):
    '''returns longitude, latitude, magnitude, depth and the origin time as unix time
    for each observed earthquake in the csv filr named NZEQCsvFileName'''
    from datetime import datetime
    import time
    from dateutil.parser import parse
    import numpy as np
    
    with open(NZEQCsvFileName) as f:
        reader = f.read() 
        dataList = reader.split('\n')
        
    myDataAccumulatorList =[]
    for data in dataList[1:-1]:
        dataRow = data.split(',')
        myTimeString = dataRow[2] # origintime
        # let's also grab longitude, latitude, magnitude, depth
        myDataString = [dataRow[4],dataRow[5],dataRow[6],dataRow[7]]
        try: 
            myTypedTime = time.mktime(parse(myTimeString).timetuple())
            myFloatData = [float(x) for x in myDataString]
            myFloatData.append(myTypedTime) # append the processed timestamp
            myDataAccumulatorList.append(myFloatData)
        except TypeError, e: # error handling for type incompatibilities
            print 'Error:  Error is ', e
    #return np.array(myDataAccumulatorList)
    return myDataAccumulatorList

myProcessedList = getLonLatMagDepTimes('data/earthquakes.csv')

def interQuakeTimes(quakeTimes):
    '''Return a list inter-earthquake times in seconds from earthquake origin times
    Date and time elements are expected to be in the 5th column of the array
    Return a list of inter-quake times in seconds. NEEDS sorted quakeTimes Data'''
    import numpy as np
    retList = []
    if len(quakeTimes) > 1:
        retList = [quakeTimes[i]-quakeTimes[i-1] for i in range(1,len(quakeTimes))]
    #return np.array(retList)
    return retList

def makeBootstrappedConfidenceIntervalOfStatisticT(dataset, statT, alpha, B=100):
    '''make a bootstrapped 1-alpha confidence interval for ANY given statistic statT 
    from the dataset with B Bootstrap replications for 0 < alpha < 1, and 
    return lower CI, upper CI, bootstrapped_samples '''
    n = len(dataset) # sample size of the original dataset
    bootstrappedStatisticTs=[] # list to store the statistic T from each bootstrapped data
    for b in range(B):
        #sample indices at random between 0 and len(iQMinutes)-1 to make the bootstrapped dataset
        randIndices=[randint(0,n-1) for i in range(n)] 
        bootstrappedDataset = dataset[randIndices] # resample with replacement from original dataset
        bootstrappedStatisticT = statT(bootstrappedDataset)
        bootstrappedStatisticTs.append(bootstrappedStatisticT)
    # noe get the [2.5%, 97.5%] percentile-based CI
    alpaAsPercentage=alpha*100.0
    lowerBootstrap1MinusAlphaCIForStatisticT = np.percentile(bootstrappedStatisticTs,alpaAsPercentage/2)
    upperBootstrap1MinusAlphaCIForStatisticT = np.percentile(bootstrappedStatisticTs,100-alpaAsPercentage/2)

    #print "The inner (1 - "+str(alpha)+" ) percentile based Confidence Interval for the statistic T = "
    #print "[ "+str(lowerBootstrap1MinusAlphaCIForStatisticT)+" , " +str(upperBootstrap1MinusAlphaCIForStatisticT)+" ]"
    return (lowerBootstrap1MinusAlphaCIForStatisticT,upperBootstrap1MinusAlphaCIForStatisticT,\
            np.array(bootstrappedStatisticTs))

interQuakesSecs = interQuakeTimes(sorted([x[4] for x in myProcessedList]))
iQMinutes = np.array(interQuakesSecs)/60.0
###############################################################################################
In [ ]:
statT99thPercentile = lambda dataset : XXX #statistic of interest
alpha=XXX
B=1000 # number of bootstrap samples, reduce this to 100 while debuging and back to 1000 when done
# plug-in point estimate of the 99th-Percentile of inter-EQ Times
plugInEstimateOf99thPercentile = XXX 
# get the bootstrapped samples and build 1-alpha confidence interval
# do NOT change anything below
lowerCIT99P,upperCIT99P,bootValuesT99P = \
                      makeBootstrappedConfidenceIntervalOfStatisticT(iQMinutes, statT99thPercentile, alpha, B)
print "The Plug-in Point Estimate of the 99th-Percentile of inter-EQ Times = ", plugInEstimateOf99thPercentile
print "1-alpha Bootstrapped CI for the 99th-Percentile of inter-EQ Times = ",(lowerCIT99P,upperCIT99P)
print "         for alpha = ",alpha.n(digits=2)," and bootstrap replicates = ",B

Assignment 3, PROBLEM 8

Maximum Points = 2

For the fitted regression model in the next cell get the residuals and plot them against the covariate [see Residual analysis section in latest 12.ipynb for the basic ideas conveyed in the last lecture]. How do the residuals compare to a Normal random variable centred at $0$ with a constant variance (summarise in a sentence or two by double-clicking this cell and writing in between the two lines --- below)?



In [ ]:
logLightIntens_logSurfTemp=[(4.37,5.23),(4.56,5.74),
(4.26,4.93),(4.56,5.74),(4.30,5.19),(4.46,5.46),(3.84,4.65),(4.57,5.27),(4.26,5.57),(4.37,5.12),(3.49,5.73),
(4.43,5.45),(4.48,5.42),(4.01,4.05),(4.29,4.26),(4.42,4.58),(4.23,3.94),(4.42,4.18),(4.23,4.18),(3.49,5.89),
(4.29,4.38),(4.29,4.22),(4.42,4.42),(4.49,4.85),(4.38,5.02),(4.42,4.66),(4.29,4.66),(4.38,4.90),(4.22,4.39),
(3.48,6.05),(4.38,4.42),(4.56,5.10),(4.45,5.22),(3.49,6.29),(4.23,4.34),(4.62,5.62),(4.53,5.10),(4.45,5.22),
(4.53,5.18),(4.43,5.57),(4.38,4.62),(4.45,5.06),(4.50,5.34),(4.45,5.34),(4.55,5.54),(4.45,4.98),(4.42,4.50)]
CleanedlogLightIntens_logSurfTemp=\
np.array([yx for yx in logLightIntens_logSurfTemp if yx[1]<5.9 and yx[0]>4]) # data range constraint
x=CleanedlogLightIntens_logSurfTemp[:,1]
y=CleanedlogLightIntens_logSurfTemp[:,0]

from scipy.linalg import lstsq
import matplotlib.pyplot as plt
import numpy as np
M1 = x[:, np.newaxis]^[0, 1]
b, res, rnk, s = lstsq(M1, y)
plt.plot(x, y, 'o', label='data')
xx = np.linspace(3.9, 5.8, 101)
yy = b[0] + b[1]*xx
plt.plot(xx, yy, label='least squares fit')
plt.xlabel('log light intensity (X)')
plt.ylabel('log surface temperature (Y)')
plt.legend(framealpha=1, shadow=True)
plt.grid(alpha=0.25)
plt.text(4, 4.7, r'$\widehat{r}(x) = \widehat{\beta}_0 + \widehat{\beta}_1 x, \quad \
\widehat{\beta}_0 = $ %(b0)0.3f , $\widehat{\beta}_1 = $ %(b1)0.3f' % {'b0': b[0], 'b1': b[1]} )
plt.show()

# Obtain the residuals and plot them (summarise in the markdown cell above)
XXX
XXX
XXX