Introduction to Data Science: A Comp-Math-Stat Approach ¶

YOIYUI001, Summer 2019¶

Assignment 2 for Course YOIYUI001¶

Fill in your Personal Number, make sure you pass the # ... Test cells and submit by email from your official uu.se student email account to raazesh.sainudiin @ math.uu.se with Subject line YOIYUI001 Assignment 2. You can submit multiple times before the deadline and your highest score will be used.

KEEP TRACK of and REPORT the said variables in each PROBLEM or up to now to help Raaz calibrate the exam fairly - your report will be used to justify exam-difficulty:

timeToCompleteThisProblemInMinutes an integer in minutes to report how much time it is taking you to complete this problem on your own (ok to discuss with your mates too; but not in exam)
didIAttendOrCatchUpFromMyMatesOnTheSecondLab is a Boolean True or False to indicate whether you attended or caught up on (from your class mates who attended) what was done in the second computer lab.
timeSpentOnLearningInThisCourseInHours an integer in hours to report how much time you have spent on this course up to now.
- This mainly applies to those who are making regular effort to keep up with the course materials, including attending and/or remotely catching up on lectures and labs, asking questions in class, constructively participating, etc.
- Those who have missed more than three of the last 4 lectures and the second lab and have not gone through the accompanying and co-evolving 0x.ipynb notebooks on their own where x is in $\{5,6,7,8\}$ should report 0 hours here.
- NOTE: 2/3 of the 5hp course is already done and you are supposed to have spent ((5*2/3)/1.5)*37=82.22 hours of full-time study on the course including attending lectures, labs, if you were not too busy with other commitments. You may reserve 2*8=16 hours during study week to revise for this course.

You should update (download most up to date versions of) the lecture notebooks 0x.ipynb notebooks before starting the Assignment.

# Enter your 12 digit personal number here and evaluate this cell
MyPersonalNumber = 'YYYYMMDDXXXX'

#tests
assert(isinstance(MyPersonalNumber, basestring))
assert(MyPersonalNumber.isdigit())
assert(len(MyPersonalNumber)==12)

# Report these variables so the exam can be calibrated fairly - your report will be used to justify exam-difficulty
didIAttendOrCatchUpFromMyMatesOnTheSecondLab = False # replace False by True, if so
timeSpentOnLearningInThisCourseInHours = 0 # replace 0 by a positive integer if it applies

Assignment 2, PROBLEM 0¶

Maximum Points = 1

We have seen how to implement a new iterator (just like a function) but with yield statement (just like return in a function). This model of computation is called continuation. This is very useful in combinatorics, especially when combined with recursion (Computational Mathematics with SageMath, SIAM, 2019, p. 346). Below is an iterator called generateWords(alphabet,L) that can generate all words of of a given length L on a given alphabet.

Your task is simple!

Just understand what the following iterator is doing from the comments in code and explanations earlier.
how we are computing the number of words of length L equalling 3 and then 23 using sum:
- via list comprehension
- via generator expression
You don't need to change any of the code in the next 4 cells, but just understand it.
Finally, try to explain by chosing the right answer below as to why the list comprhension is taking longer to compute than the generator expression as evident by the Wall time (see Wall Time, it's just the elapsed real time from the start to end of a computation).

%%time 
# time for list comprehension to compute the sum of [1,1,1,...,2^23]
sumFromListCom = sum( [ 1 for w in generateWords(['H','T'], 23) ]  )

will result in output:

CPU times: user 6.94 s, sys: 200 ms, total: 7.14 s
Wall time: 7.11 s

%%time 
# time for generator expression to compute the sum of [1,1,1,...,2^23]
sumFromGenEx = sum( ( 1 for w in generateWords(['H','T'], 23) )  )

will result in output:

CPU times: user 5.51 s, sys: 0 ns, total: 5.51 s
Wall time: 5.52 s

(you may have slightly different numbers for time and Wall time based on your machine details at the time of computation).

Multiple-choice Question:

Why is the Wall time for generator expression (genex) smaller that for the list comprehension (listcomp) here?

Answer Choices

A. genex if faster because the individual words are not allocated space in memory, i.e., materialised in memory
B. listcomp is slower because the list of all words is allocated space in memory
C. both A and B are true

choiceForProblem0 = 'X' # replace X by A, B or C

Local Test for Assignment 2, PROBLEM 0¶

Evaluate cell below to make sure your answer is valid. You should not modify anything in the cell below when evaluating it to do a local test of your solution. You may need to include and evaluate code snippets from lecture notebooks in cells above to make the local test work correctly sometimes (see error messages for clues). This is meant to help you become efficient at recalling materials covered in lectures that relate to this problem. Such local tests will generally not be available in the exam.

try:
    assert(choiceForProblem0 in ['A','B','C'])
    print "You have chosen one of the possible options. Hopefully, you are correct."
except AssertionError:
    print "Try again. you have to choose between 'A', 'B' and 'C'."

You have chosen one of the possible options. Hopefully, you are correct.

# This cell is to help you make the right choice between A, B and C
def generateWords(alphabet, L):
    if L == 0:
        yield []
    else:
        for word in generateWords(alphabet, L-1): # here is the recursion when we cann the iterator again on L-1
            for L in alphabet: 
                yield word + [L]

print [ w for w in generateWords(['H','T'], 3) ] # now call the iterator to find all words of length 3 in ['H','T']

print sum( [ 1 for w in generateWords(['H','T'], 3) ]  ) # these words can then be counted by list comprehension
print sum( ( 1 for w in generateWords(['H','T'], 3) )  ) # these words can then be counted by generator expression

print 'The number of words of length 3 from an alphabet of size 2 is 2^3 = ', 2^3 # the above sum`s makes sense

[['H', 'H', 'H'], ['H', 'H', 'T'], ['H', 'T', 'H'], ['H', 'T', 'T'], ['T', 'H', 'H'], ['T', 'H', 'T'], ['T', 'T', 'H'], ['T', 'T', 'T']]
8
8
The number of words of length 3 from an alphabet of size 2 is 2^3 =  8

Assignment 2, PROBLEM 1¶

Maximum Points = 1

Recall how we downloaded Pride and Prejudice and processed it as a String and split it by Chapters. These code snippets are at our disposal now - all we need to do is copy-paste the right set of cells from earlier into the cells below here to have the string from that Book for more refined processing.

Think about what algorithmic constructs and methods one will need to split each sentence by the English words it contains and then count the number of each distinct word.

Now that you have understood for loops, list comprehensions and anonymous functions, and can learn about the needed methods on strings for splitting (which you can search by adding a . after a srt and hitting the Tab button to look through existing methods and followed by ? for their docstrings), the dictionary data structure, and already seen how to count the number of ball labels, you are ready for this problem stated below. If you attended the lab then you have an advantage if you tried to work on this with some help from your instructors.

Problem: Process the English words in a text file, such as those in the book Pride and Prejudice by Jane Austin, and obtain the top K most frequent words that are longer than a given parameter wordLongerThan which can be any value in $\mathbb{Z}_+ := \{ 0, 1, 2, 3, 4, \ldots \}$, i.e., words that are longer than wordLongerThan many characters in length.

Your function must be generic and named as follows including input parameter order and names:

frequencyOftheKMostCommonWordsIn(thisTextFile, wordLongerThan, K)

This function must be capable of:

reading any available text file in the data/ directory that can be passed as the parameter thisTextFile
and return a dict type whose:
- key is the word whose character length is longer than the parameter wordlongerThan and
- value is the frequency of this word in the text file.
- Yor returned dict should only contain the top K most frequent words longer than wordLongerThan and be already sorted in descending order of in frequency.

Use the next cell to submit your answer and for rough-work use more cells as needed in order to copy-paste code snippets from earlier content to get this working. But please remove the cells for rough-work when done.

Note: that you may not import libraries that have not been introduced in the course so far.

# Report these variables so the exam can be calibrated fairly - your report will be used to justify exam-difficulty
timeToCompleteThisProblemInMinutes = 0 # replace 0 by a positive integer if it applies

# Do NOT change the name of the function and names of paramaters !

thisTextFile = 'data/someTextFilename' # try a text file in data/ directory
wordLongerThan = 0 # this can be any larger integer also
K = 20 # this can be any integer larger than 0 also

def frequencyOftheKMostCommonWordsIn(thisTextFile, wordLongerThan, K):
    '''explain what the function is supposed to do briefly'''
    # write the body of the function and replace 'None' with the correct return value
    # ...
    # ...
    return None

Assignment 2, PROBLEM 2¶

Maximum Points = 1

Recall the problem above on counting the number of votes by party across all of Sweden from the Swedish 2018 National Election Data.

Your task is to adapt the code snippets there and others we have encountered thus far to count the total number of votes by each district and return a list of Integers giving the number of votes for the top K districts with the most votes. Your function numberOfVotesInKMostVotedDistrictsInSE('data/final.csv', K) should work for any valid integer K.

Note: that you may not import libraries that have not been introduced in the course so far.

unzip issues: If you are unable to call unzip final.csv.zip on your windows laptop. You can either do it in the computer lab or do the following with internet access to download the large final.csv file from the internet:

%%sh
cd data

curl -O http://lamastex.org/datasets/public/elections/2018/sv/final.csv

Then you should have the needed data/final.csv file.

# Report these variables so the exam can be calibrated fairly - your report will be used to justify exam-difficulty
timeToCompleteThisProblemInMinutes = 0 # replace 0 by a positive integer if it applies

# Do NOT change the name of the function and names of paramaters !

K = 20 # this can be any integer larger than 0 also, change K and make sure your function works
filename = 'data/final.csv' # this has to be a csv file with the same structure as out final.csv

def numberOfVotesInKMostVotedDistrictsInSE(filename, K):
    '''explain what the function is supposed to do briefly'''
    # write the body of the function and replace 'None' with the correct return value
    # ...
    # ...
    return None

Local Test for Assignment 2, PROBLEM 2¶

Evaluate cell below to make sure your answer is valid. You should not modify anything in the cell below when evaluating it to do a local test of your solution. You may need to include and evaluate code snippets from lecture notebooks in cells above to make the local test work correctly sometimes (see error messages for clues). This is meant to help you become efficient at recalling materials covered in lectures that relate to this problem. Such local tests will generally not be available in the exam.

# test that your answer is indeed a probability by evaluating this cell after you replaced XXX above and evaluated it.
try:
    assert(numberOfVotesInKMostVotedDistrictsInSE('data/final.csv', 3) == [13435, 10625, 7910])
    assert(numberOfVotesInKMostVotedDistrictsInSE('data/final.csv', 1) == [13435])
    print("Your answer is correct for two test cases with K=3 and K=1. Hopefully it works correctly for any K")
except AssertionError:
    print("Try again! and make sure you are actually producing what is expected of you.")

Your answer is correct for two test cases with K=3 and K=1. Hopefully it works correctly for any K

Assignment 2, PROBLEM 3¶

Maximum Points = 1

A disadvantage of using list comprehension is that we cannot create a lot of random numbers as we will have to store the returned list. Since you know about generators your task is to use the following warm-up on generating natural numbers and write an iterator version called lcg of the function LinConGen we have been seeing thus far.

def naturals():
    '''define the countably infinite set of natural numbers using an iterator'''
    n = 1 # the first natural number 1
    while True: # an infinite while loop
        yield n # output n
        n = n + 1 # increment n by 1

# Example run - keep printing the natural numbers using the iterator until we hit 5
for n in naturals():
      print(n)
      if n >= 5:
          break

# printing next from our iterator
generateNaturals = naturals() # let's assign our iterator
print(generateNaturals.next())
print(generateNaturals.next())

1
2

zip(naturals(), ['a', 'b', 'c', 'd']) # the second list stops at 4 to give an enumeration that ends

[(1, 'a'), (2, 'b'), (3, 'c'), (4, 'd')]

# Here is the actual task 
# just replace XXX with the right values to make an iterator of function LinConGen
def lcg(m, a, c, x0):
    x = XXX
    while True:
        yield XXX
        x = XXX

Local Test for Assignment 2, PROBLEM 3¶

Evaluate cell below to make sure your answer is valid. You should not modify anything in the cell below when evaluating it to do a local test of your solution. You may need to include and evaluate code snippets from lecture notebooks in cells above to make the local test work correctly sometimes (see error messages for clues). This is meant to help you become efficient at recalling materials covered in lectures that relate to this problem. Such local tests will generally not be available in the exam.

def linConGen(m, a, c, x0, n):
    '''A linear congruential sequence generator.
    
    Param m is the integer modulus to use in the generator.
    Param a is the integer multiplier.
    Param c is the integer increment.
    Param x0 is the integer seed.
    Param n is the integer number of desired pseudo-random numbers.
    
    Returns a list of n pseudo-random integer modulo m numbers.'''
    
    x = x0 # the seed
    retValue = [Mod(x,m)]  # start the list with x=x0
    for i in range(2, n+1, 1):
        x = Mod(a * x + c, m) # the generator, using modular arithmetic
        retValue.append(x) # append the new x to the list
    return retValue
try:
    m, a, c, x0, n = 2147483648, 65539, 0, 1, 5010
    randuListComp = linConGen(m, a, c, x0, n)
    randuGenExp = [x for i,x in zip(xrange(n), lcg(m, a, c, x0))]
    assert(randuListComp == randuGenExp)
    print("It seems like you have passed a test. Hopefully you have the right answer.")
except AssertionError:
    print("Try again. You need the output of your iterator lcg to coincide with that of LinConGen")

It seems like you have passed a test. Hopefully you have the right answer.

Assignment 2, PROBLEM 4¶

Maximum Points = 1

First read and understand the following simple simulation (originally written by Jenny Harlow). Then you will modify the simulation to find the solution to this problem.

A Simple Simulation¶

We could use the samplers we have made to do a very simple simulation. Suppose the inter-arrival times, in minutes, of Orbiter buses at an Orbiter stop in Christchurch follows an $Exponential(\lambda = 0.1)$ distribution. Also suppose that this is quite a popular bus stop, and the arrival of people is very predictable: one new person will arrive in each whole minute. This means that the longer another bus takes in coming, the more people arrive to join the queue. Also suppose that the number of free seats available on any bus follows a $de\, Moivre(k=40)$ distribution, i.e, there are equally like to to be 1, or 2, or 3 ... or 40 spare seats. If there are more spare seats than people in the queue, everyone can get onto the bus and nobody is left waiting, but if there are not enough spare seats some people will be left waiting for the next bus. As they wait, more people arrive to join the queue....

This is not very realistic - we would want a better model for how many people arrive at the stop at least, and for the number of spare seats there will be on the bus. However, we are just using this as a simple example that you can do using the random variables you already know how to simulate samples from.

Try to code this example yourself, using our suggested steps. We have put our version the code into a cell below, but you will get more out of this example by trying to do it yourself first.

Suggested steps:¶

Get a list of 100 $Exponential(\lambda = 0.1)$ samples using the exponentialSamples function. Assign the list to a variable named something like busTimes. These are your 100 simulated bus inter-arrival times.
Choose a value for the number of people who will be waiting at the busstop when you start the simulation. Call this something like waiting.
Make a list called something like leftWaiting, which to begin with contains just the value assigned to waiting.
Make an empty list called something like boardBus.
Start a for loop which takes each element in busTimes in turn, i.e. each bus inter-arrival time, and within the for loop:
- Calculate the number of people arriving at the stop as the floor of the time taken for that bus to arrive (i.e., one person for each whole minute until the bus arrives)
- Add this to the number of people waiting (e.g., if the number of arrivals is assigned to a variable arrivals, then waiting = waiting + arrivals will increment the value assigned to the waiting variable by the value of arrivals).
- Simulate a value for the number of seats available on the bus as one simulation from a $de \, Moirve(k=40)$ RV (it may be easier to use deMoivreFInverse rather than deMoivreSample because you only need one value - remember that you will have to pass a simulated $u \thicksim Uniform(0,1)$ to deMoivreFInverse as well as the value of the parameter $k$).
- The number of people who can get on the bus is the minimum of the number of people waiting in the queue and the number of seats on the bus. Calculate this value and assign it to a variable called something like getOnBus.
- Append getOnBus to the list boardBus.
- Subtract getOnBus from the number of people waiting, waiting (e.g., waiting = waiting - getOnBus will decrement waiting by the number of people who get on the bus).
- Append the new value of waiting to the list leftWaiting.
That is the end of the for loop: you now have two lists, one for the number of people waiting at the stop and one for the number of people who can board each bus as it arrives.

Here is our code to do the bus stop simulation. Yours may be different - maybe it will be better!

You are expected to find the needed functions from the latest notebook this assignment came from and be able to answer this question. Unless you can do it in your head.

def busStopSimulation(buses, lam, seats):
    '''A Simple Simulation - see description above!'''
    BusTimes = exponentialSample(buses,lam)
    waiting = 0 # how many people are waiting at the start of the simulation
    BoardBus = [] # empty list
    LeftWaiting = [waiting] # list with just waiting in it
    for time in BusTimes: # for each bus inter-arrival time
        arrivals = floor(time) # people who arrive at the stop before the bus gets there
        waiting = waiting + arrivals # add them to the queue
        busSeats = deMoivreFInverse(random(), seats) # how many seats available on the bus
        getOnBus = min(waiting, busSeats) # how many people can get on the bus
        BoardBus.append(getOnBus) # add to the list
        waiting = waiting - getOnBus # take the people who board the bus out of the queue
        LeftWaiting.append(waiting) # add to the list
    return [LeftWaiting, BoardBus, BusTimes]

# let's simulate the people left waiting at the bus stop
set_random_seed(None) # replace None by a integer to fix seed and output of simulation
buses = 100
lam = 0.1
seats = 40
leftWaiting, boardBus, busTimes = busStopSimulation(buses, lam, seats)

print(leftWaiting) # look at the leftWaiting list

print(boardBus) # boad bus

print(busTimes)

[0, 0, 0, 0, 0, 8, 6, 14, 18, 17, 24, 3, 0, 0, 19, 14, 0, 0, 0, 0, 6, 0, 0, 1, 0, 0, 0, 2, 0, 0, 0, 0, 16, 13, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 12, 1, 0, 0, 0, 0, 0, 0, 0, 0, 32, 34, 33, 19, 0, 0, 12, 19, 19, 0, 0, 14, 0, 2, 0, 16, 0, 0, 17, 28, 64, 39, 28, 14, 0, 0, 4, 0, 0, 0, 0, 0, 12, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[3, 3, 11, 13, 5, 12, 11, 3, 2, 23, 32, 4, 28, 1, 15, 18, 7, 6, 3, 3, 6, 15, 1, 2, 4, 26, 11, 2, 4, 3, 13, 8, 22, 13, 8, 1, 3, 5, 7, 21, 7, 16, 1, 1, 7, 3, 10, 11, 12, 4, 6, 4, 2, 3, 4, 8, 4, 1, 2, 27, 20, 8, 6, 11, 5, 21, 0, 23, 15, 11, 7, 25, 19, 0, 8, 6, 20, 28, 31, 14, 16, 2, 22, 15, 20, 1, 7, 10, 3, 23, 5, 0, 12, 7, 0, 0, 4, 22, 3, 0]
[3.55186105789830, 3.91293323687127, 11.7603818390426, 13.3753351332180, 13.5223457947002, 10.8378239573718, 19.1598021235742, 7.99635269072175, 1.31581231816128, 30.7125174707266, 11.1201268379955, 1.97443313796591, 28.2519517149384, 20.5443134817688, 10.7278625876135, 4.67186935271976, 7.62008310870979, 6.95679877319448, 3.41289743476174, 9.14702069545650, 0.210916324422715, 15.3320017945975, 2.01352699724159, 1.41024891173278, 4.43039114402281, 26.1394313460252, 13.4868034500580, 0.319273575445181, 4.45398651549891, 3.45084565490733, 13.5669413550116, 24.8333046781771, 19.0672418345034, 0.307175841544859, 8.22203926477697, 1.15747571986727, 3.75883530602798, 5.31581732456943, 7.90897413327541, 21.2569951835936, 7.52092661461753, 16.4833882446308, 1.91184560320560, 1.26322899605742, 7.32451969992288, 3.92769512476959, 22.8074715340580, 0.0359841960844426, 11.1028973769768, 4.12712551405327, 6.47426983301915, 4.25186350510631, 2.20080587623502, 3.46773029884932, 4.75605364855475, 8.58464270847100, 36.1373636567407, 3.56577213427303, 1.05648936051896, 13.5829675432194, 1.45294069791568, 8.79219751716493, 18.8532131517004, 18.2766955891001, 5.75327561417561, 2.76430151729299, 0.977194679229802, 37.8955718394206, 1.47276690795379, 13.3961376192606, 5.32998142917280, 41.5330850668633, 3.24351132430572, 0.184614144119221, 25.3605320985740, 17.8191035865503, 56.9368844043080, 3.74167171122488, 20.7186925218423, 0.711867066554327, 2.03832526633115, 2.33894195232516, 26.2140312664970, 11.5918882087366, 20.8471470581114, 1.64669778481851, 7.16034094224753, 10.8327157394571, 15.3270206446366, 11.4079132025549, 5.10944082904906, 0.416070260457845, 12.8986947277824, 7.03466271195385, 0.631670331560185, 0.900312448355977, 4.44879506711314, 22.3627871942507, 3.57265868657557, 0.612618918172660]

We could do an interactive visualisation of this by evaluating the next cell. This will be showing the number of people able to board the bus and the number of people left waiting at the bus stop by the height of lines on the plot.

@interact
def _(seed=[0,123,456], lam=[0.1,0.01], seats=[40,10,1000]):
    set_random_seed(seed)
    buses=100
    leftWaiting, boardBus, busTimes = busStopSimulation(buses, lam,seats)
    p1 = line([(0.5,0),(0.5,leftWaiting[0])])
    from pylab import cumsum
    csBusTimes=list(cumsum(busTimes))
    for i in range(1, len(leftWaiting), 1):
    
        p1+= line([(csBusTimes[i-1],0),(csBusTimes[i-1],boardBus[i-1])], rgbcolor='green')
        p1+= line([(csBusTimes[i-1]+.01,0),(csBusTimes[i-1]+.01,leftWaiting[i])], rgbcolor='red')

    t1 = text("Boarding the bus", (csBusTimes[len(busTimes)-1]/3,max(max(boardBus),max(leftWaiting))+1), \
          rgbcolor='green',fontsize=10) 
    t2 = text("Waiting", (csBusTimes[len(busTimes)-1]*(2/3),max(max(boardBus),max(leftWaiting))+1), \
          rgbcolor='red',fontsize=10)     
    xaxislabel = text("Time", (csBusTimes[len(busTimes)-1],-10),fontsize=10,color='black')
    yaxislabel = text("People", (-50,max(max(boardBus),max(leftWaiting))+1),fontsize=10,color='black')
    show(p1+t1+t2+xaxislabel+yaxislabel,figsize=[8,5])

Very briefly explain the effect of varying one of the three parameters:

seed
lam
seats

while holding the other two parameters fixed on:

the number of people waiting at the bus stop and
the number of people boarding the bus

by using the dropdown menus in the @interact above. Think if the simulation makes sense and explain why. You can write down your answers using keyboard by double-clicking this cell and writing between --- and ---.

Introduction to Data Science: A Comp-Math-Stat Approach¶

YOIYUI001, Summer 2019¶

Assignment 2 for Course YOIYUI001¶

Assignment 2, PROBLEM 0¶

Local Test for Assignment 2, PROBLEM 0¶

Assignment 2, PROBLEM 1¶

Assignment 2, PROBLEM 2¶

Local Test for Assignment 2, PROBLEM 2¶

Assignment 2, PROBLEM 3¶

Local Test for Assignment 2, PROBLEM 3¶

Assignment 2, PROBLEM 4¶

A Simple Simulation¶

Suggested steps:¶

Introduction to Data Science: A Comp-Math-Stat Approach ¶