05. Random Variables, Expectations, Data, Statistics, Arrays and Tuples, Iterators and Generators

Inference Theory 1

©2018 Raazesh Sainudiin. Attribution 4.0 International (CC BY 4.0)

Topics

  1. Continuous Random Variables
  2. Expectations
  3. Data and Statistics
  4. Sample Mean
  5. Sample Variance
  6. Order Statistics
  7. Frequencies
  8. Empirical Mass Function
  9. Empirical Distribution Function
  10. Live Data Ingestion with Try-Catch
  11. Arrays
  12. Tuples
  13. Iterators
  14. Generators

Random Variables

A random variable is a mapping from the sample space $\Omega$ to the set of real numbers $\mathbb{R}$. In other words, it is a numerical value determined by the outcome of the experiment.

We already saw discrete random variables that take values in a discrete set, of two types:

  • those with with finitely many values, eg. the two values in $\{0,1\}$ for the Bernoulli$(\theta)$ RV
  • those with countably infinitely many values, eg. values in the set of all non-negative integers: $\{0,1,2,\ldots\}$, for the 'infinite coin tossing experiment' that records the number of consecutive 'Tails' you see before the first 'Heads' occurs.

Now, we will see the other main type of real-valued random variable.

Continuous random variable

When a random variable takes on values in the continuum we call it a continuous RV.

Examples

  • Volume of water that fell on the Southern Alps yesterday (See video link below)
  • Vertical position above sea level, in micrometers, since the original release of a pollen grain at the head waters of a river
  • Resting position in degrees of a roulettet wheel after a brisk spin

Probability Density Function

A RV $X$ with DF $F$ is called continuous if there exists a piece-wise continuous function $f$, called the probability density function (PDF) $f$ of $X$, such that, for any $a$, $b \in \mathbb{R}$ with $a < b$,

$$ P(a < X \le b) = F(b)-F(a) = \int_a^b f(x) \ dx \ . $$

The following hold for a continuous RV $X$ with PDF $f$:

For any $x \in \mathbb{R}$, $P(X=x)=0$. Consequentially, for any $a,b \in \mathbb{R}$ with $a \le b$ $$P(a < X < b ) = P(a < X \le b) = P(a \leq X \le b) = P(a \le X < b)$$ By the fundamental theorem of calculus, except possibly at finitely many points (where the continuous pieces come together in the piecewise-continuous $f$): $$f(x) = \frac{d}{dx} F(x)$$ And of course $f$ must satisfy: $$\int_{-\infty}^{\infty} f(x) \ dx = P(-\infty < X < \infty) = 1$$

You try at home

Watch the Khan Academy video about probability density functions to warm-up to the meaning behind the maths above. Consider the continuous random variable $Y$ that measures the exact amount of rain tomorrow in inches. Think of the probability space $(\Omega,\mathcal{F},P)$ underpinning this random variable $Y:\Omega \to \mathbb{Y}$. Here the sample space, range or support of the random variable $Y$ denoted by $\mathbb{Y} = [0,\infty) =\{y : 0 \leq y < \infty\}$.

In [1]:
def showURL(url, ht=500):
    """Return an IFrame of the url to show in notebook with height ht"""
    from IPython.display import IFrame
    return IFrame(url, width='95%', height=ht) 
showURL('https://en.wikipedia.org/wiki/Integral',600)
Out[1]:

The Uniform$(0,1)$ RV

The Uniform$(0,1)$ RV is a continuous RV with a probability density function (PDF) that takes the value 1 if $x \in [0,1]$ and $0$ otherwise. Formally, this is written

$$ \begin{equation} f(x) = \mathbf{1}_{[0,1]}(x) = \begin{cases} 1 & \text{if } 0 \le x \le 1 ,\\ 0 & \text{otherwise} \end{cases} \end{equation} $$

and its distribution function (DF) or cumulative distribution function (CDF) is:

$$ \begin{equation} F(x) := \int_{- \infty}^x f(y) \ dy = \begin{cases} 0 & \text{if } x < 0 , \\ x & \text{if } 0 \le x \leq 1 ,\\ 1 & \text{if } x > 1 \end{cases} \end{equation} $$

Note that the DF is the identity map in $[0,1]$.

The PDF, CDF and inverse CDF for a Uniform$(0,1)$ RV are shown below

<img src="images/Uniform01ThreeCharts.png" alt="Uniform01ThreeCharts" width=500>

The Uniform$(0,1)$ is sometimes called the Fundamental Model.

The Uniform$(0,1)$ distribution comes from the Uniform$(a,b)$ family.

$$ \begin{equation} f(x) = \mathbf{1}_{[a,b]}(x) = \begin{cases} \frac{1}{(b-a)} & \text{if } a \le x \le b,\\ 0 & \text{otherwise} \end{cases} \end{equation} $$

This is saying that if $X$ is a Uniform$(a,b)$ RV, then all values of $x$ between $a$ and $b$, i.e., $a \le x \le b$, are equally probable. The Uniform$(0,1)$ RV is the member of the family where $a=0$, $b=1$.

The PDF and CDF for a Uniform$(a,b)$ RV are shown from wikipedia below

500px-Uniform_Distribution_PDF_SVG.svg.png wikipedia image 500px-Uniform_cdf.svg.png

You can dive deeper into this family of random vaiables here.

SageMath has a function for simulating samples from a Uniform$(a,b)$ distribution. We will learn more about this later in the course. Let's go ahead and use it to simulate samples from it below.

In [4]:
uniform(-1,1)  # reevaluate the cell to see how the samples change upon each re-evaluation
Out[4]:
-0.17367706188369336
In [ ]:
 

Expectations

The expectation of $X$ is also known as the population mean, first moment, or expected value of $X$.

$$ \begin{equation} E\left(X\right) := \int x \, dF(x) = \begin{cases} \sum_x x \, f(x) & \qquad \text{if }X \text{ is discrete} \\ \int x \, f(x)\,dx & \qquad \text{if } X \text{ is continuous} \end{cases} \end{equation} $$

Sometimes, we denote $E(X)$ by $E X$ for brevity. Thus, the expectation is a single-number summary of the RV $X$ and may be thought of as the average.

In general though, we can talk about the Expectation of a function $g$ of a RV $X$.

The Expectation of a function $g$ of a RV $X$ with DF $F$ is:

$$ \begin{equation} E\left(g(X)\right) := \int g(x)\,dF(x) = \begin{cases} \sum_x g(x) f(x) & \qquad \text{if }X \text{ is discrete} \\ \int g(x) f(x)\,dx & \qquad \text{if } X \text{ is continuous} \end{cases} \end{equation} $$

provided the sum or integral is well-defined. We say the expectation exists if

$$ \begin{equation} \int \left|g(x)\right|\,dF(x) < \infty \ . \end{equation} $$

When we are looking at the Expectation of $X$ itself, we have $g(x) = x$

Thinking about the Expectations like this, can you see that the familiar Variance of X is in fact the Expection of $g(x) = (x - E(X))^2$?

The variance of $X$ (a.k.a. second moment)

Let $X$ be a RV with mean or expectation $E(X)$. The variance of $X$ denoted by $V(X)$ or $VX$ is

$$ V(X) := E\left((X-E(X))^2\right) = \int (x-E(X))^2 \,d F(x) $$

provided this expectation exists. The standard deviation denoted by $\sigma(X) := \sqrt{V(X)}$.

Thus variance is a measure of ``spread'' of a distribution.

The $k$-th moment of a RV comes from the Expectation of $g(x) = x^k$.

We call

$$ E(X^k) = \int x^k\,dF(x) $$

the $k$-th moment of the RV $X$ and say that the $k$-th moment exists when $E(|X|^k) < \infty$.

Properties of Expectations

  1. If the $k$-th moment exists and if $j
  2. If $X_1,X_2,\ldots,X_n$ are RVs and $a_1,a_2,\ldots,a_n$ are constants, then $E \left( \sum_{i=1}^n a_i X_i \right) = \sum_{i=1}^n a_i E(X_i)$
  3. Let $X_1,X_2,\ldots,X_n$ be independent RVs, then
    • $E \left( \prod_{i=1}^n X_i \right) = \prod_{i=1}^{n} E(X_i)$
  4. $V(X) = E(X^2) - (E(X))^2$
  5. If $a$ and $b$ are constants, then: $V \left(aX + b\right) = a^2V(X) $
  6. If $X_1,X_2,\ldots,X_n$ are independent and $a_1,a_2,\ldots,a_n$ are constants, then: $V \left( \sum_{i=1}^n a_i X_i \right) = \sum_{i=1}^n a_i^2 V(X_i)$

You try at home

Watch the Khan Academy videos about probability density functions and expected value if you want to get another angle on the material more slowly step-by-step:

The population mean and variance of the Bernoulli$(\theta)$ RV

We have already met the discrete Bernoulli$(\theta)$ RV. Remember, that if we have an event $A$ with $P(A) = \theta$, then a Bernoulli$(\theta)$ RV $X$ takes the value $1$ if "$A$ occurs" with probability $\theta$ and $0$ if "$A$ does not occur" with probability $1-\theta$.

In other words, the indicator function $\mathbf{1}_A$ of "$A$ occurs" with probability $\theta$ is the Bernoulli$(\theta)$ RV.

For example, flip a fair coin.
Consider the event that it turns up heads. Since the coin is fair, the probability of this event $\theta$ is $\frac{1}{2}$. If we define an RV $X$ that takes the value 1 if the coin turns up heads ("event coin turns up heads occurs") and 0 otherwise, then we have a Bernoulli$(\theta = \frac{1}{2})$ RV.

We all saw that given a parameter $\theta \in [0,1]$, the probability mass function (PMF) for the Bernoulli$(\theta)$ RV $X$ is:

$$ f(x;\theta) = \theta^x (1-\theta)^{1-x} \mathbf{1}_{\{0,1\}}(x) = \begin{cases} \theta & \text{if } x=1 \ ,\\ 1-\theta & \text{if } x=0 \ ,\\ 0 & \text{otherwise} \end{cases} $$

and its DF is:

$$ \begin{equation} F(x;\theta) = \begin{cases} 1 & \text{if } 1 \le x \ ,\\ 1-\theta & \text{if } 0 \le x < 1 \ ,\\ 0 & \text{otherwise} \end{cases} \end{equation} $$

Now let's look at some expectations: the population mean and variance of an RV $X \thicksim$ Bernoulli$(\theta)$.

Because $X$ is a discrete RV, our expectations use sums rather than integrals.

The first moment or expectation is:

$$ \begin{array}{lcl} E(X) & = & \displaystyle\sum_{x=0}^{1}xf(x;\theta) \\ &=& (0 \times (1-\theta)) + (1 \times \theta)\\ &=& 0 + \theta\\ &=& \theta\\ \end{array} $$

The second moment is: $$\begin{array}{lcl} E(X^2) &=& \displaystyle\sum_{x=0}^{1}x^2f(x;\theta) \\ &=& (0^2 \times (1-\theta)) + (1^2 \times \theta)\\ &=& 0 + \theta\\ &=& \theta \end{array} $$

The variance is: $$ \begin{array}{lcl} V(X) &=& E(X^2) - \left(E(X)\right)^2\\ &=& \theta - \theta^2\\ &=& \theta(1-\theta) \end{array} $$

We can see that $E(X)$ and $V(X)$ will vary with the parameter $\theta$. This is why we subscript $E$ and $V$ with $\theta$, to emphasise that the values depend on the parameter.

$$E_{\theta}(X) = \theta$$$$V_{\theta}(X) = \theta(1-\theta)$$

We can use Sage to do a simple plot to see how $E_{\theta}(X)$ and $V_{\theta}(X)$ vary with $\theta$.

In [5]:
def bernoulliPopMean(th):
    '''A function to find the population mean for an RV distributed Bernoulli(theta).
    
    parameter th is the distribution parameter theta.'''

    return th
    
def bernoulliPopVariance(th):
    '''A function to find the population variance for an RV distributed Bernoulli(theta).
    
    parameter th is the distribution parameter theta.'''
    
    return th*(1-th)
In [6]:
?points
In [10]:
# using a list comprehension into Geometric primitive of 'points' we can plot as follows
p = points([(x*0.01,bernoulliPopMean(x*0.01)) for x in range(0,100)], rgbcolor="blue") # assign the plot to p
p += points([(x*0.01,bernoulliPopVariance(x*0.01)) for x in range(0,100)], rgbcolor="red") # add variance plot to p
show(p, figsize=[6,3])

Note how the variance is maximized at $\theta=\frac{1}{2}$.

The population mean and variance of the Uniform$(0,1)$ RV

Now let's look at the the population mean and variance of a continuous RV $X \thicksim$ Uniform$(0,1)$.

Because $X$ is a continuous RV, our expectations use integrals.

$$ \begin{array}{lcl} E(X) &=&\int_{x=0}^1 x f(x)\, dx\\ &=& \int_{x=0}^1 x \ 1 \, dx\\ &=& \frac{1}{2} \left( x^2 \right]_{x=0}^{x=1}\\ &=& \frac{1}{2} \left( 1-0 \right)\\ &=& \frac{1}{2} \end{array} $$$$ \begin{array}{lcl} E(X^2) &=& \int_{x=0}^1 x^2 f(x)\, dx \\ &=& \int_{x=0}^1 x^2 \ 1 \, dx\\ &=& \frac{1}{3} \left( x^3 \right]_{x=0}^{x=1}\\ &=& \frac{1}{3} \left( 1-0 \right)\\ &=& \frac{1}{3}\\ \end{array} $$$$ \begin{array}{lcl} V(X) &=& E(X^2) - \left(E(X)\right)^2\\ &=&\frac{1}{3} - \left( \frac{1}{2} \right)^2\\ &=& \frac{1}{3} - \frac{1}{4}\\ &=& \frac{1}{12} \end{array} $$

Winnings on Average

Think about playing a game where we draw $x \thicksim X$ and I pay you $r(x)$ ($r$ is some reward function, a function of $x$ that says what your reward is when $x$ is drawn). Then, your average winnings from the game is the sum (or integral), over all the possible values of $x$, of $r(x) \times$ the chance that $X=x$.

Put formally, if $Y= r(X)$, then

$$E(Y) = E(r(X)) = \int r(x) \,dF(x)$$

Probability is an Expectation

Remember when we first talked about the probability of some event $A$, we talked about the idea of the probability of $A$ as the long term relative frequency of $A$?

Now consider some event $A$ and a reward function $r(x) = \mathbf{1}_A(x)$.

Recall that $\mathbf{1}_A(x) = 1$ if $ x \in A$ and $\mathbf{1}_A(x) = 0$ if $ x \notin A$: the reward is 1 if $x \in A$ and 0 otherwise.

$$ \begin{array}{lcl} \text{If } X \text{ is continuous } E(\mathbf{1}_A(X)) &=& \int \mathbf{1}_A(x)\, dF(x)\\ &=& \int_A f(x)\, dx\\ &=& P(X \in A) = P(A)\\ \text{If } X \text{ is discrete } E(\mathbf{1}_A(X)) &=& \mathbf{1}_A(x)\, dF(x)\\ &=& \sum_{x \in A} f(x)\\ &=& P(X \in A) = P(A) \\ \end{array} $$

This says that probability is a special case of expectation: the probability of $A$ is the expectation that $A$ will occur.

Take a Uniform$(0,1)$ RV $X$. What would you say the probability that an observation of this random variable is $\le 0.5$ is, ie what is $P(X \le 0.5)$?

Let's use SageMath to simulate some observations for us and look at the relative frequency of observations $\le 0.5$:

In [12]:
uniform(0,1) # remember calling this each time changes the outcome - reevaluate this cell and see
Out[12]:
0.11257123852945661
In [16]:
countObOfInterest = 0    # variable to count observations of interest
numberOfObs = 1000        # variable to control how many observations we simulate
obOfInterest = 0.5        # variable for observation of interest
for i in range(numberOfObs): # loop to simulate observations
    if uniform(0,1) <= obOfInterest:    # conditional statement to check observation
        countObOfInterest += 1    # accumulate count of observation of interest
        
print "The relative frequency of x <=", obOfInterest.n(digits=2), \
      " was", RR(countObOfInterest/numberOfObs).n(digits=3)   # just formatting out print output
The relative frequency of x <= 0.50  was 0.487

Or, we could look at a similar simulation for a discrete RV, say a Bernoulli$(\frac{1}{2})$ RV.

Another way of thinking about the Bernoulli$(\frac{1}{2})$ RV is that it has a discrete uniform distribution over $\{0,1\}$. It can take on a finite number of values (0 and 1 only) and the probabilities of observing either of these two values are are equal.

This could be modelling the event that we get a head when we throw a fair coin. For this we'll use the randint(0,1) function to simulate the observed value of our RV $X$.

In [19]:
randint(0,1) # try again and again
Out[19]:
0
In [23]:
countObOfInterest = 0    # variable to count observations of interest
numberOfObs = 1000        # variable to control how many observations we simulate
obOfInterest = 1        # variable for observation of interest
for i in range(numberOfObs): # loop to simulate observations
    if randint(0,1) == obOfInterest:    # conditional statement to check observation
        countObOfInterest += 1    # accumulate count of observation of interest
        
print "The relative frequency of x ==", obOfInterest, \
      " was", RR(countObOfInterest/numberOfObs).n(digits=3)   # just formatting out print output
The relative frequency of x == 1  was 0.500

The de Moivre$(\frac{1}{k}, \frac{1}{k}, \ldots, \frac{1}{k})$ RV

We have seen that a Bernoulli$(\theta)$ RV has two outcomes (0, and 1). What if we are interested in modelling situations where there are more than two outcomes of interest? For example, we could use a $Bernoulli(\frac{1}{2})$ RV to model whether the outcome of the flip of a fair coin is a head, but we can't use it for modelling a RV which is the number we get when we toss a six-sided die.

So, now, we will consider a natural generalization of the Bernoulli$(\theta)$ RV with more than two outcomes. This is called the de Moivre$(\frac{1}{k}, \frac{1}{k}, \ldots, \frac{1}{k})$ RV (after Abraham de Moivre, 1667-1754), one of the first great analytical probabalists).

A de Moivre$(\frac{1}{k}, \frac{1}{k}, \ldots, \frac{1}{k})$ RV $X$ has a discrete uniform distribution over $\{1, 2, ..., k\}$: there are $k$ possible equally probable ('equiprobable') values that the RV can take.

If we are rolling a die and $X$ is the number on die, then $k=6$.

Or think of the New Zealand Lotto game. There are 40 balls in the machine, numbered $1, 2, \ldots, 40$. The number on the first ball out of the machine can be modelled as a de Moivre$(\frac{1}{40}, \frac{1}{40}, \ldots, \frac{1}{40})$ RV.

We say that an RV $X$ is de Moivre$(\frac{1}{k}, \frac{1}{k}, \ldots, \frac{1}{k})$ distributed if its probability mass function PMF is:

$$ f \left(x; \left( \frac{1}{k}, \frac{1}{k}, \ldots, \frac{1}{k} \right) \right) = \begin{cases} 0 & \quad \text{if } x \notin \{1,2,\ldots,k\}\\ \frac{1}{k} & \quad \text{if } x \in \{1,2,\ldots,k\} \end{cases} $$

We can find the expectation: $$ \begin{array}{lcl} E(X) & = & \sum_{x=1}^k xP(X=x)\\ &=& (1 \times \frac{1}{k}) + (2 \times \frac{1}{k}) + \ldots + (k \times \frac{1}{k})\\ &=& (1 + 2 + \dots + k)\frac{1}{k}\\ &=& \frac{k(k+1)}{2}\frac{1}{k}\\ &=& \frac{k+1}{2} \, , \end{array} $$

the second moment: $$ \begin{array}{lcl} E(X^2) & =& \sum_{x=1}^k x^2P(X=x)\\ & =& (1^2 \times \frac{1}{k}) + (2^2 \times \frac{1}{k}) + \ldots + (k^2 \times \frac{1}{k})\\ &=& (1^2 + 2^2 + \dots + k^2)\frac{1}{k}\\ &=& \frac{k(k+1)(2k+1)}{6}\frac{1}{k}\\ &=& \frac{2k^2+3k+1}{6}\, , \end{array} $$

and finally the variance: $$ \begin{array}{lcl} V(X) &=& E(X^2) - \left(E(X)\right)^2\\ &=& \frac{2k^2+3k+1}{6} - \left( \frac{k+1}{2} \right)^2\\ &=&\frac{2k^2+3k+1}{6} - \frac{k^2 + 2k +1}{4}\\ &=& \frac{4(2k^2 + 3k + 1) - 6(k^2 + 2k + 1) }{24}\\ &=& \frac{8k^2 + 12k + 4 - 6k^2 - 12k - 6 }{24}\\ &=& \frac{2k^2-2}{24} \\ &=& \frac{k^2-1}{12} \, . \end{array} $$

We coud use the Sage randint function to simulate the number on the first ball in a Lotto draw:

In [32]:
myList=[] 
nn=10000
for i in range(nn):
    myList.append(randint(1,40))
len([x for x in myList if x==20])/RR(nn)
Out[32]:
0.0254000000000000
In [30]:
1/40.
Out[30]:
0.0250000000000000

Statistics

The official NZ Government site for statistics about New Zealand is http://www.stats.govt.nz/ and that for Sweden is http://www.scb.se/en/About-us/official-statistics-of-sweden/

Take a tour through it!

Data and statistics

In general, given some probability triple $(\Omega, \mathcal{F}, P)$, let the function $X$ measure the outcome $\omega$ from the sample space $\Omega$.

$$X(\omega): \Omega \rightarrow \mathbb{X}$$

The RV $X$ is called data, if $X$ is the finest available measurement of $\Omega$.

$\mathbb{X}$ is called the data space (sample space of the data $X$).

$X(\omega)=x$ is the outcome $\omega$ measured by $X$ and is called the observed data or the realisation of $X$.

Often the measurements made by $X$ are real numbers or structures of real numbers or structures that are a mixture of combinatorial objects and real numbers (they typically arise from more general data objects/structures, like addresses, images, videos, trajectories, networks, etc.).

Recall the definition of a real-valued or $\mathbb{R}$-valued random variable we have already seen (an $\mathbb{R}$-valued random variable $X$ is a specific function or map from the sample space to the real line $\mathbb{R}$). We have also seen that $X$ can in fact be a real-vector-valued or $\mathbb{R}^n$-valued random variable $X=(X_1,X_2,\ldots,X_n)$, i.e. a vector of $\mathbb{R}$-valued random variables or simply a random vector. A random variable can be much more general as we will see in the sequel - it can be image-valued, movie-valued, mp3-song-valued, graph-valued, etc.

When we talked about an experiment involving two IID Bernoulli$(\frac{1}{2})$ RVs above (tossing two coins), we listed the different results we might get as (0,0), (1,0), (0,1), (1,1).

Say we observe the outcome $\omega$ = (H, H) (two heads). Our $Bernoulli$ random vector $X$ measures this as (1,1). Thus, (1,1) is the observed data or the realisation of $X$.

So what is a statistic?

A statistic is any measureable function of the data: $T(x): \mathbb{X} \rightarrow \mathbb{T}$ or simply that $\mathbb{X} \overset{T}{\rightarrow} \mathbb{T}$.

Thus, a statistic $T$ is also an RV that takes values in the space $\mathbb{T}$.

When $x \in \mathbb{X}$ is the observed data, $T(x)=t$ is the observed statistic of the observed data $x$, i.e., $x \mapsto t$.

Question: Can Statistic be the Data?

<p style="color:#F8F9F9";>Consider the identity map... with $T(x)=x$ and $\mathbb{X} = \mathbb{T}$.</p>

Remember $\Omega$ is always under the hood!

$$ \Omega \overset{X}{\rightarrow} \mathbb{X} \overset{T}{\rightarrow} \mathbb{T}, \quad \text{with} \quad \omega \mapsto x \mapsto t. $$

Generally, one cannot explicity work with $\Omega$ (recall $\Omega$ for the indicator random variable of rain on Southern Alps tomorrow!) but can work with the following axiomatic cascade towards more concretely measurable events:

  • observable events in a sigma-algebra $\mathcal{F}(\Omega)$ on $\Omega$
  • that are captured by the RV $X$ as measurable sets in $\mathcal{F}(\mathbb{X})$, the sigma-algebra over the data space $\mathbb{X}$
  • and typically projected further by the statistic $T$, another RV that measurable maps the sigm-algebra $\mathcal{F}(\mathbb{X})$ to the sigma-algebra $\mathcal{F}(\mathbb{T})$.

Example 2: New Zealand Lotto Data and a Probability Model of it as IID de Moivre RVs

New Zealand lotto data

Let's look at our pre-made New Zealand lotto data. This is the winning first ball drawn from a NZ Lotto machine for several years.

In [2]:
# These lotto draws of the first ball from NZ Lotto was already downloaded and processed for you
listBallOne = [4, 3, 11, 35, 23, 12, 14, 13, 15, 19, 36, 18, 37, 39, 37, 35, 39, 1, 24, 29, 38, 18, 40, 35, \
            12, 7, 14, 23, 21, 35, 14, 32, 19, 2, 1, 34, 39, 29, 7, 20, 2, 40, 28, 4, 30, 34, 20, 37, 9, 24,\
            36, 4, 22, 1, 31, 12, 16, 29, 36, 5, 21, 23, 30, 39, 38, 22, 13, 6, 14, 30, 40, 21, 5, 12, 28, 27,\
            13, 18, 19, 23, 2, 10, 37, 31, 40, 4, 25, 4, 17, 6, 34, 26, 38, 35, 3, 38, 14, 40, 3, 30, 21, 4,\
            24, 34, 27, 14, 25, 18, 21, 1, 25, 39, 18, 40, 18, 11, 5, 37, 33, 26, 29, 26, 36, 33, 18, 32, 3, 1,\
            5, 22, 39, 25, 12, 21, 23, 12, 31, 1, 35, 8, 32, 24, 34, 14, 26, 4, 3, 31, 17, 22, 24, 10, 29, 40,\
            4, 8, 26, 11, 8, 18, 25, 22, 8, 30, 10, 14, 32, 14, 5, 35, 3, 32, 40, 17, 39, 7, 21, 4, 35, 9, 16,\
            30, 30, 11, 28, 22, 38, 5, 16, 27, 16, 23, 22, 1, 27, 32, 30, 24, 32, 29, 11, 3, 26, 19, 22, 25, 3,\
            34, 31, 17, 16, 31, 20, 29, 10, 2, 17, 36, 6, 34, 11, 7, 22, 28, 13, 15, 20, 39, 16, 10, 25, 1, 37,\
            14, 28, 35, 20, 39, 3, 39, 20, 40, 6, 20, 17, 26, 27, 4, 24, 40, 16, 24, 7, 8, 25, 16, 15, 8, 29, 13,\
            16, 39, 2, 24, 24, 23, 24, 37, 39, 40, 5, 11, 13, 6, 24, 1, 5, 7, 15, 38, 3, 35, 10, 22, 19, 3, 21,\
            39, 38, 4, 30, 17, 15, 9, 32, 28, 7, 12, 6, 37, 25, 4, 8, 30, 7, 31, 12, 21, 31, 13, 2, 20, 14, 40,\
            32, 23, 10, 1, 35, 35, 32, 16, 25, 13, 20, 33, 27, 2, 26, 12, 5, 34, 20, 7, 34, 38, 20, 8, 5, 11, 17,\
            10, 36, 34, 1, 36, 6, 7, 37, 22, 33, 7, 32, 18, 8, 1, 37, 25, 35, 29, 23, 11, 19, 7, 21, 30, 23, 12,\
            10, 26, 21, 9, 9, 25, 2, 14, 16, 14, 25, 40, 8, 28, 19, 8, 35, 22, 23, 27, 31, 36, 22, 33, 22, 15, 3,\
            37, 8, 2, 22, 39, 3, 6, 13, 33, 18, 37, 28, 3, 17, 8, 2, 36, 1, 14, 38, 5, 31, 34, 16, 37, 2, 40, 14,\
            16, 21, 40, 5, 21, 24, 24, 38, 26, 38, 33, 20, 25, 7, 33, 12, 22, 34, 34, 20, 38, 12, 20, 7, 28, 26,\
            30, 13, 40, 36, 29, 11, 31, 15, 9, 13, 17, 32, 18, 9, 24, 6, 40, 1, 1, 9, 13, 28, 19, 5, 7, 27, 12,\
            3, 34, 26, 20, 28, 28, 25, 21, 23, 6, 15, 19, 30, 10, 13, 8, 11, 38, 7, 33, 12, 16, 11, 40, 25, 32,\
            34, 1, 32, 31, 33, 15, 39, 9, 25, 39, 30, 35, 20, 34, 3, 30, 17, 24, 20, 15, 10, 25, 6, 39, 19, 20,\
            23, 16, 17, 31, 25, 8, 17, 15, 31, 20, 19, 33, 11, 37, 31, 4, 12, 37, 7, 40, 8, 22, 3, 25, 35, 8, 9,\
            14, 13, 33, 4, 2, 1, 31, 24, 8, 13, 19, 34, 10, 32, 35, 28, 11, 10, 31, 25, 8, 6, 13, 33, 19, 35, 19,\
            8, 21, 10, 40, 36, 16, 27, 31, 1, 18, 36, 40, 18, 37, 18, 24, 33, 34, 31, 6, 10, 24, 8, 7, 24, 27, 12,\
            19, 23, 5, 33, 20, 2, 32, 33, 6, 13, 5, 25, 7, 31, 40, 1, 30, 37, 19, 27, 40, 28, 3, 24, 36, 7, 22,\
            20, 21, 36, 38, 15, 11, 37, 21, 4, 13, 9, 12, 13, 34, 30, 8, 23, 40, 4, 13, 6, 4, 22, 35, 2, 35, 20,\
            9, 28, 9, 13, 33, 19, 5, 38, 24, 18, 37, 10, 25, 25, 31, 3, 13, 25, 35, 1, 36, 21, 3, 22, 23, 7, 6,\
            26, 11, 6, 1, 24, 2, 25, 38, 3, 16, 16, 20, 22, 12, 8, 27, 38, 10, 39, 9, 37, 30, 33, 12, 4, 32, 2,\
            29, 6, 34, 2, 3, 12, 9, 1, 22, 40, 38, 9, 18, 40, 17, 5, 17, 26, 17, 26, 6, 7, 18, 10, 27, 24, 39, 1,\
            3, 26, 38, 2, 12, 5, 7, 38, 2, 8, 30, 35, 18, 19, 29, 37, 5, 27, 35, 40, 14, 25, 15, 20, 32, 22, 9, 1,\
            8, 14, 38, 27, 23, 24, 15, 29, 7, 4, 19, 6, 21, 27, 23, 21, 35, 32, 13, 27, 34, 1, 11, 36, 24, 23, 13,\
            2, 33, 25, 18, 1, 10, 5, 27, 1, 36, 36, 11, 3, 31, 30, 31, 39, 7, 21, 25, 28, 38, 2, 3, 40, 10, 40,\
            12, 22, 20, 16, 14, 30, 16, 19, 33, 32, 30, 19, 36, 16, 27, 7, 18, 38, 14, 14, 33, 29, 24, 21, 22, 15,\
            25, 27, 25, 37, 35, 34, 11, 19, 35, 10, 30, 8, 11, 20, 7, 27, 19, 16, 21, 13, 6, 29, 35, 13, 31, 23,\
            26, 10, 18, 39, 38, 5, 16, 33, 21, 31, 21, 23, 32, 35, 2, 24, 11, 25, 30, 7, 18, 32, 38, 22, 27, 2, 6,\
            31, 24, 34, 33, 15, 39, 21, 9, 1, 8, 38, 37, 40, 14, 2, 25, 30, 16, 6, 36, 27, 28, 8, 17, 37, 15, 29,\
            27, 30, 30, 19, 15, 13, 34, 5, 24, 18, 40, 37, 1, 28, 17, 32, 8, 34, 5, 6, 31, 8, 9, 28, 26, 40, 40,\
            9, 23, 36, 28, 24, 33, 18, 36, 6, 22, 29, 6, 6, 25, 15, 29, 18, 38, 20, 26, 30, 17, 30, 32, 33, 19,\
            10, 29, 25, 24, 19, 28, 38, 3, 24, 12, 28, 29, 29, 20, 12, 11, 12, 21, 11, 24, 36, 3, 3, 5, 28, 2,\
            8, 30, 23, 4, 40, 28, 6, 31, 37, 25, 9, 23, 20, 20, 16, 38, 21, 35, 18, 3, 15, 40, 19, 33, 34, 20,\
            3, 11, 34, 35, 10, 32, 23, 10, 29, 13, 12, 6, 30, 7, 5, 4, 29, 22, 22, 2, 26, 24, 7, 13, 26, 27, 27,\
            15, 12, 18, 38, 33, 4, 11, 20, 33, 21, 5, 26, 10, 22, 36, 3, 4, 35, 35, 16, 32, 5, 19, 23, 24, 40,\
            25, 30, 10, 9, 23, 12, 40, 21, 29, 18, 17, 15, 32, 2, 35, 7, 30, 4, 2, 16, 6, 8, 35]
In [3]:
len(listBallOne) # there are 1114 draws from 
Out[3]:
1114

Data Space

The data space is every possible sequence of ball numbers that we could have got in these 1114 draws. $\mathbb{X} = \{1, 2, \ldots, 40\}^{1114}$. There are $40^{1114}$ possible sequences and

Probability model

We can think of this list listBallOne of sample size $n=1114$ as $x = (x_1,x_2,\ldots,x_{1114})$, the realisation of a random vector $X = (X_1, X_2,\ldots, X_{1114})$ where $X_1, X_2,\ldots, X_{1114} \overset{IID}{\thicksim} de Moivre(\frac{1}{40}, \frac{1}{40}, \ldots, \frac{1}{40})$.

$P \left( (X_1, X_2, \ldots, X_{1114}) = (x_1, x_2, \ldots, x_{1114}) \right) = \frac{1}{40} \times \frac{1}{40} \times \ldots \times \frac{1}{40} = \left(\frac{1}{40}\right)^{1114}$ if $(x_1, x_2, \ldots, x_{1114}) \in \mathbb{X} = \{1, 2, \ldots, 40\}^{1114}$

Our data is just one of the $40^{1114}$ possible points in this data space.

Some statistics

Sample mean

From a given sequence of RVs $X_1, X_2, \ldots, X_n$, or a random vector $X = (X_1, X_2, \ldots, X_n)$, we can obtain another RV called the sample mean (technically, the $n$-sample mean):

$$T_n((X_1, X_2, \ldots, X_n)) = \bar{X_n}((X_1, X_2, \ldots, X_n)) :=\frac{1}{n}\displaystyle\sum_{i=1}^nX_i$$

We write $\bar{X_n}((X_1, X_2, \ldots, X_n))$ as $\bar{X}$,

and its realisation $\bar{X_n}((x_1, x_2, \ldots, x_n))$ as $\bar{x_n}$.

By the properties of expectations that we have seen before,

$$E(\bar{X_n}) = E \left(\frac{1}{n}\sum_{i=1}^nX_i \right) = \frac{1}{n}E\left(\sum_{i=1}^nX_i\right) = \frac{1}{n}\sum_{i=1}^nE(X_i)$$

And because every $X_i$ is identically distributed with the same expectation, say $E(X_1)$, then

$$E(\bar{X_n}) = \frac{1}{n}\sum_{i=1}^nE(X_i)= \frac{1}{n} \times n \times E(X_1) = E(X_1)$$

Similarly, we can show that,

$$V(\bar{X_n}) = V\left(\frac{1}{n}\sum_{i=1}^nX_i\right) = \frac{1}{n^2}V\left(\sum_{i=1}^nX_i\right)$$

And because every $X_i$ is independently and identically distributed with the same variance, say $V(X_1)$, then

$$V(\bar{X_n}) = \frac{1}{n^2}V\left(\sum_{i=1}^nX_i \right) = \frac{1}{n^2} \times n \times V(X_1) = \frac{1}{n} V(X_1)$$

Sample variance

Sample variance is given by $$ \frac{1}{n}\sum_{i=1}^n \left( X_i - \bar{X}_n \right)^2$$ Sometimes, we divide by $n-1$ instead of $n$. It is a measure of spread from the sample.

Similarly, we can look at a sample standard deviation = $\sqrt{\text{sample variance}}$

numpy for numerical Python

Python has a nice module called numpy for numerical computing akin to MATLAB. We will import numpy to use some of its basic statistical capabilities for our Lotto data. To be able to use these capabilities, it is easiest to convert our Lotto Data into a type called a numpy.array or np.array if we import numpy as np. (You will be looking more closely at numpy arrays in the You Try sections below).

In [4]:
import numpy as np
#import array, mean, var, std # make pylab stuff we need accessible in Sage
arrayBallOne = np.array(listBallOne) # make the array out of the list
arrayBallOne     # disclose the list
Out[4]:
array([ 4,  3, 11, ...,  6,  8, 35])
In [ ]:
arrayBallOne.  # hit the Tab after the . and see the available methods on numpy arrays

Now we can get the some sample statistics for the lotto ball one data.

Sample mean of NZ Lotto data

In [5]:
arrayBallOne.mean()  # this calls the mean method and gives our first statistic, the sample mean
Out[5]:
20.649910233393179
In [6]:
arrayBallOne.mean  # if you just call the method without '()' then you are not evaluating it
Out[6]:
<built-in method mean of numpy.ndarray object at 0x7f3959944f80>

Sample vairance and sample standard deviation of NZ Lotto data

In [7]:
arrayBallOne.var()
Out[7]:
135.08030968673549
In [8]:
arrayBallOne.std()
Out[8]:
11.622405503454759
In [9]:
sqrt(arrayBallOne.var())   # just checking std is indeed sqrt(var)
Out[9]:
11.62240550345476

Order statistics

The $k$th order statistic of a sample is the $k$th smallest value in the sample. Order statistics that may be of particular interest include the smallest (minimum) and largest (maximum) values in the sample.

In [10]:
arrayBallOne.min()         # sample minimum statistic
Out[10]:
1
In [11]:
arrayBallOne.max()         # sample maximum statistic
Out[11]:
40
In [12]:
arrayBallOneSorted = np.array(listBallOne) # make the array out of the list for sorting next
arrayBallOneSorted.sort()        # sort the array to get the order statistic - this sorts arrayBallOneSorted 
In [13]:
arrayBallOneSorted             # the sorted data array - numpy automatically displays with intermediates as ...
Out[13]:
array([ 1,  1,  1, ..., 40, 40, 40])
In [14]:
arrayBallOne                   # the original data array
Out[14]:
array([ 4,  3, 11, ...,  6,  8, 35])

Frequencies

The next most interesting statistics we will visit is the frequencies of the ball numbers.

With lots of discrete data like this, we may be interested in the frequency of each value, or the number of times we get a particular value. This can be formalized as the following process of adding indicator functions of the data points $(X_1,X_2,\ldots,X_n)$: $$ \sum_{i=1}^n \mathbf{1}_{\{X_i\}}(x) $$

We can use the SageMath/Python dictionary to give us a mapping from ball numbers in the list to the count of the number of times each ball number comes up.

A Function to Count Elements in a Sequence

Although we are doing this with the Lotto data, mapping a list like this seems like a generally useful thing to be able to do, and so we write it as a function makeFreqDict(myDataSeq) and then use the function on our Lotto data. This is a good example of reusing useful code through a function. We will rely on this function in the sequel for other input data.

We can use long integers if we need to count a lot of items or the usual int type in Python to count the elements in myDataSeq. We can use the default value for the second parameter named one to be of type int by the following second argument with default parameter value: one=int(1) to makeFreqDict. This will enable us to specify one=long(1) if we want the counting to be done in long integers if necessary. See https://docs.python.org/2.0/ref/function.html for default parameters in functions.

Also Python's integer types are supposed to get as big as one needs according to this discussion:

We will still pass in the additional argument for pedagogical reasons and to avoid the more heavy-duty SageMath's type of 'sage.rings.integer.Integer'.

In [15]:
# Note that long is a Python long integer
myLongInteger = long(1)
print(myLongInteger)
type(myLongInteger)
1
Out[15]:
<type 'long'>
In [16]:
type(1)
Out[16]:
<type 'sage.rings.integer.Integer'>
In [17]:
def makeFreqDict(myDataSeq, one = int(1)):
    '''Make a frequency mapping out of a sequence of data - list, array, str.
    
    Param myDataList, a list of data.
    Return a dictionary mapping each unique data value to its frequency count.'''
       
    freqDict = {} # start with an empty dictionary
        
    for res in myDataSeq:
        
        if res in freqDict: # the data value already exists as a key
                freqDict[res] = freqDict[res] + one #int(1) # add 1 to the count
        else: # the data value does not exist as a key value
            freqDict[res] = one # add a new key-value pair for this new data value, frequency 1
        
    return freqDict # return the dictionary created
In [20]:
myExampleDataList = [1,2,2,3,3,3] # a list with one 1, two 2's and three 3's
myExampleFreqDict = makeFreqDict(myExampleDataList) # the frequencies are returned as a dictionary
myExampleFreqDict
Out[20]:
{1: 1, 2: 2, 3: 3}
In [23]:
myExampleFreqDict.items()  # this returns the (k,v) or key,value pairs in the dictionary as a list
Out[23]:
[(1, 1), (2, 2), (3, 3)]
In [24]:
myExampleFreqDict.keys() # this returns the keys
Out[24]:
[1, 2, 3]
In [25]:
myExampleFreqDict.values() # this returns the values which happens to be same in our case!
Out[25]:
[1, 2, 3]
In [26]:
myExampleFreqLongsDict = makeFreqDict(myExampleDataList,long(1)) # the long frequencies are returned as a dictionary
myExampleFreqLongsDict
Out[26]:
{1: 1L, 2: 2L, 3: 3L}
In [37]:
myExampleFreqIntDict = makeFreqDict(myExampleDataList, int(1)) # the int frequencies are returned as a dictionary
myExampleFreqIntDict
Out[37]:
{1: 1, 2: 2, 3: 3}
In [38]:
myExampleFreqDict = makeFreqDict(myExampleDataList) # the frequencies are returned as a dictionary with default one
myExampleFreqDict
Out[38]:
{1: 1, 2: 2, 3: 3}
In [36]:
myExampleFreqIntDict  == myExampleFreqDict   # th last two freq dictionaries are the same!
Out[36]:
True

Note that the function makeFreqDict also works on other sequence types like numpy.arrays, and str or sequence of bytes.

In [39]:
myExampleDataArray = np.array(myExampleDataList)
makeFreqDict(myExampleDataArray)
Out[39]:
{1: 1, 2: 2, 3: 3}
In [40]:
myExampleDataStr = '122333'
makeFreqDict(myExampleDataStr)
Out[40]:
{'1': 1, '2': 2, '3': 3}

Let us use the makeFreqDict to find the frequencies of the balls that popped out first in the 1114 NZ Lotto draws.

In [41]:
ballOneFreqs = makeFreqDict(listBallOne) # call the function
print(ballOneFreqs)                         # disclose it
{1: 29, 2: 28, 3: 31, 4: 24, 5: 27, 6: 30, 7: 31, 8: 32, 9: 22, 10: 27, 11: 25, 12: 28, 13: 29, 14: 22, 15: 22, 16: 27, 17: 20, 18: 28, 19: 28, 20: 32, 21: 30, 22: 30, 23: 27, 24: 36, 25: 37, 26: 22, 27: 26, 28: 24, 29: 24, 30: 33, 31: 28, 32: 27, 33: 28, 34: 27, 35: 34, 36: 24, 37: 26, 38: 30, 39: 22, 40: 37}

Thus, balls labelled by the number 1 come up 29 times in the first ball drawn (Ball One) out of 1,114 Lotto trials. Similarly, the number 2 comes up 28 times, ... 40 comes up 37 times. Of course, these numbers would be different if you have downloaded a more recent data file with additional trials!

So what? Well, we'd hope that the lotto is fair, i.e. that the probability of each ball coming up with any of the available numbers is the same for each number: the probability that Ball One is a 1 is the same as the probability that it is 2, is the same as the probability that it is 3,..... , is the same as the probability that it is 40. If the lotto is fair, the number that comes up on each ball should be a discrete uniform random variable. More formally, we would expect it to be the de Moivre$(\frac{1}{40}, \frac{1}{40}, \ldots, \frac{1}{40})$ RV as lectured. Over the long term, we'd expect the number of times each number comes up on a given trial to be about the same as the number of times any other number comes up on that trial.

We have data from 1987 to 2008, and a first step to assessing the fairness of the lotto (for Ball One, anyway) could be to just visualise the data. We can use the list of points we created above and the SageMath plotting function points to plot a simple graphic like this.

Here we are plotting the frequency with which each number comes up against the numbers themselves, but it is a bit hard to see what number on the ball each red point relates to. To deal with this we add dotted lines going up from the number on the horizontal axis to the corresponding (number, frequency) tuple plotted as a red point.

For these plots let us use SageMath Graphics primitives: point, points and line, and their + operations for superimpositions.

In [ ]:
?point # Returns either a 2-dimensional or 3-dimensional point or "sum" of points as Graphics object
In [42]:
?line # Returns either a 2-dimensional or 3-dimensional line depending on value of points as Graphics object
In [43]:
print(ballOneFreqs.items())
[(1, 29), (2, 28), (3, 31), (4, 24), (5, 27), (6, 30), (7, 31), (8, 32), (9, 22), (10, 27), (11, 25), (12, 28), (13, 29), (14, 22), (15, 22), (16, 27), (17, 20), (18, 28), (19, 28), (20, 32), (21, 30), (22, 30), (23, 27), (24, 36), (25, 37), (26, 22), (27, 26), (28, 24), (29, 24), (30, 33), (31, 28), (32, 27), (33, 28), (34, 27), (35, 34), (36, 24), (37, 26), (38, 30), (39, 22), (40, 37)]
In [45]:
# DATA we plot each item (k,f) in ballOneFreqs.items() as a red points at (k,f), k is ball number, f is frequency
lottoPlotCounts = point(ballOneFreqs.items(), rgbcolor="red")
# MODEL next loop through each key k and add a blue dotted line that goes from (k, 0) to (k, ballOneFreqs[k])
for k in ballOneFreqs.keys():
    lottoPlotCounts += line([(k, 0),(k, ballOneFreqs[k])], rgbcolor="blue", linestyle=":")
show(lottoPlotCounts, figsize=[8,3])

Empirical mass function - just another statistic

What about plotting the height of a point as the relative frequency rather than the frequency? The relative frequency for a number is the count (frequency) for that number divided by the sample size, i.e., the total number of trials. This can be formalized as the following process of adding normalized indicator functions of the data points $(X_1,X_2,\ldots,X_n)$: $$\frac{1}{n} \sum_{i=1}^n \mathbf{1}_{\{X_i\}}(x)$$

In the next section on List comprehensions and tuples below we will show how the guts of the makeEMF function works. For now just evaluate the next cell and use the makeEMF so that you can just concentrate on the data for now. All makeEMF does is turn the frequencies in the above plot to relative frequences that sum to $1$ instead of the sample size $n=1114$.

Now, we plot the points based on relative frequencies. What we have made is another statistic called the empirical mass function.

In [46]:
def makeEMF(myDataList):
    '''Make an empirical mass function from a data list.
    
    Param myDataList, list of data to make emf from.
    Return list of tuples comprising (data value, relative frequency) ordered by data value.''' 

    freqs = makeFreqDict(myDataList)     # make the frequency counts mapping
    totalCounts = RR(sum(freqs.values())) # make mpfr_real to make relFreqs are fractions    
    relFreqs = [fr/totalCounts for fr in freqs.values()] # use a list comprehension
    numRelFreqPairs = zip(freqs.keys(), relFreqs) # zip the keys and relative frequencies together
    
    return numRelFreqPairs
In [47]:
numRelFreqPairs = makeEMF(listBallOne) # make a list of unique data values and their relative frequencies

lottoPlotEMF = point(numRelFreqPairs, rgbcolor = "purple")
for k in numRelFreqPairs:    # for each tuple in the list
    kkey, kheight = k     # unpack tuple
    lottoPlotEMF += line([(kkey, 0),(kkey, kheight)], rgbcolor="blue", linestyle=":")
show(lottoPlotEMF, figsize=[8,3])

Let us plot the probability mass function (PMF) of the hypothesized probability model for a fair NZ Lotto, i.e., the PMF of the RV $X$ ~ de Moivre$(1/40,1/40, ... , 1/40)$ and overlay it on the empirical mass function of the observed data. We will meet list comprehension properly in a future worksheet.

In [48]:
# list comprehension by the constant 1/40 for f(x)=1/40, x=1,2,...,40
ballOneEqualProbs = [1/40 for x in range(1,41,1)]    
In [49]:
print(ballOneEqualProbs)
[1/40, 1/40, 1/40, 1/40, 1/40, 1/40, 1/40, 1/40, 1/40, 1/40, 1/40, 1/40, 1/40, 1/40, 1/40, 1/40, 1/40, 1/40, 1/40, 1/40, 1/40, 1/40, 1/40, 1/40, 1/40, 1/40, 1/40, 1/40, 1/40, 1/40, 1/40, 1/40, 1/40, 1/40, 1/40, 1/40, 1/40, 1/40, 1/40, 1/40]
In [50]:
# make a list of (x,f(x)) tuples for the PDF using zip
numEqualProbsPairs = zip(ballOneFreqs.keys(), ballOneEqualProbs) 
print(numEqualProbsPairs) # print the list of tuple
[(1, 1/40), (2, 1/40), (3, 1/40), (4, 1/40), (5, 1/40), (6, 1/40), (7, 1/40), (8, 1/40), (9, 1/40), (10, 1/40), (11, 1/40), (12, 1/40), (13, 1/40), (14, 1/40), (15, 1/40), (16, 1/40), (17, 1/40), (18, 1/40), (19, 1/40), (20, 1/40), (21, 1/40), (22, 1/40), (23, 1/40), (24, 1/40), (25, 1/40), (26, 1/40), (27, 1/40), (28, 1/40), (29, 1/40), (30, 1/40), (31, 1/40), (32, 1/40), (33, 1/40), (34, 1/40), (35, 1/40), (36, 1/40), (37, 1/40), (38, 1/40), (39, 1/40), (40, 1/40)]
In [51]:
numRelFreqPairs = makeEMF(listBallOne) # make a list of unique data values and their relative frequencies

# make the EMF plot
lottoPlotEMF = point(numRelFreqPairs, rgbcolor = "purple")
for k in numRelFreqPairs:    # for each tuple in the list
    kkey, kheight = k     # unpack tuple
    lottoPlotEMF += line([(kkey, 0),(kkey, kheight)], rgbcolor="blue", linestyle=":")

ballOneEqualProbs = [1/40 for x in range(1,41,1)]    # make sure we have the equal probabilities
numEqualProbsPairs = zip(ballOneFreqs.keys(), ballOneEqualProbs) # and the pairs

# make a plot of the equal probability pairs
equiProbabledeMoivre40PMF = point(numEqualProbsPairs, rgbcolor = "green")
for e in numEqualProbsPairs:    # for each tuple in list
    ekey, eheight = e     # unpack tuple
    equiProbabledeMoivre40PMF += line([(ekey, 0),(ekey, eheight)], rgbcolor="green", linestyle=":")
show(equiProbabledeMoivre40PMF + lottoPlotEMF, figsize=[8,3]) # plot the PMF and the PDF

Empirical Distribution Function

Another extremely important statistics of the observed data is called the empirical distribution function (EDF). The EDF is the empirical or data-based distribution function (DF) just like the empirical mass function (EMF) is the empirical or data-based probability mass function (PMF). This can be formalized as the following process of adding indicator functions of the half-lines beginning at the data points $[X_1,+\infty),[X_2,+\infty),\ldots,[X_n,+\infty)$: $$\widehat{F}_n (x) = \frac{1}{n} \sum_{i=1}^n \mathbf{1}_{[X_i,+\infty)}(x) $$

Next, let's make a function called makeEDF to return the empirical distribution function from a list of data points.

In [52]:
def makeEDF(myDataList):
    '''Make an empirical distribution function from a data list. 
    
        Param myDataList, list of data to make emf from.
        Return list of tuples comprising (data value, cumulative relative frequency) ordered by data value.
    ''' 
    freqs = makeFreqDict(myDataList)     # make the frequency counts mapping
    totalCounts = RR(sum(freqs.values())) # make mpfr_real to make relFreqs as fractions
    relFreqs = [fr/totalCounts for fr in freqs.values()] # use a list comprehension
    relFreqsArray = np.array(relFreqs)
    cumFreqs = list(relFreqsArray.cumsum())
    numCumFreqPairs = zip(freqs.keys(), cumFreqs) # zip the keys and culm relative frequencies together
    
    return numCumFreqPairs
In [53]:
numCumFreqPairs = makeEDF(listBallOne)
lottoPlotEDF = points(numCumFreqPairs, rgbcolor = "black", faceted = true)
for k in range(len(numCumFreqPairs)):
    x, kheight = numCumFreqPairs[k]     # unpack tuple
    previous_x = 0
    previous_height = 0
    if k > 0:
        previous_x, previous_height = numCumFreqPairs[k-1] # unpack previous tuple
    lottoPlotEDF += line([(previous_x, previous_height),(x, previous_height)], rgbcolor="grey")
    lottoPlotEDF += points((x, previous_height),rgbcolor = "white", faceted = true)
    lottoPlotEDF += line([(x, previous_height),(x, kheight)], rgbcolor="blue", linestyle=":")
show(lottoPlotEDF, figsize=(8,6))

Real-world Data Ingestion

Let us download the New York Power Ball winning lottery numbers since 2010 from this US government site:

using this url as a comma-separated-variable or csv file:

This is a live fetch! So we have to figure it out on the fly as new records are added to the winning New York power ball data.

In [54]:
from urllib import * 
NYpowerball = urlopen('https://data.ny.gov/api/views/d6yy-54nr/rows.csv?accessType=DOWNLOAD').read().decode('utf-8')
In [55]:
type(NYpowerball) # this is a unicode sequence
Out[55]:
<type 'unicode'>
In [56]:
NYpowerballLines = NYpowerball.split('\n') # this splits the unicode by the end of line character '\n' in a list
In [57]:
type(NYpowerballLines) # we have a list of lines now
Out[57]:
<type 'list'>
In [58]:
NYpowerballLines[0:10] #  we can see the first 10 lines now and the first 0-index row is the header line
Out[58]:
[u'Draw Date,Winning Numbers,Multiplier',
 u'02/03/2010,17 22 36 37 52 24,2',
 u'02/06/2010,14 22 52 54 59 04,3',
 u'02/10/2010,05 08 29 37 38 34,5',
 u'02/13/2010,10 14 30 40 51 01,4',
 u'02/17/2010,07 08 19 26 36 15,3',
 u'02/20/2010,13 27 37 41 54 32,2',
 u'02/24/2010,04 17 35 50 57 12,2',
 u'02/27/2010,18 47 51 53 58 30,2',
 u'03/03/2010,07 09 14 45 49 23,4']
In [59]:
len(NYpowerballLines) # looks like there are 866 items (lines) in the list 
Out[59]:
866
In [61]:
NYpowerballLines[1] # this grabs the first row of data - the winning numbers from Feb 03 2010
Out[61]:
u'02/03/2010,17 22 36 37 52 24,2'
In [62]:
NYpowerballLines[1].split(',') # this further split the row by comma character ',' into another list
Out[62]:
[u'02/03/2010', u'17 22 36 37 52 24', u'2']
In [63]:
NYpowerballLines[1].split(',')[1] # this grabs the second element of the list with index 1
Out[63]:
u'17 22 36 37 52 24'
In [64]:
NYpowerballLines[1].split(',')[1].split(' ') # this splits the 6 integers by white space ' '
Out[64]:
[u'17', u'22', u'36', u'37', u'52', u'24']
In [65]:
NYpowerballLines[1].split(',')[1].split(' ')[0] # this grabs the first element of this list with 0-index
Out[65]:
u'17'
In [66]:
int(NYpowerballLines[1].split(',')[1].split(' ')[0]) # now we can turn this string into a Python int
Out[66]:
17

That was not too hard. But what we really want is the first winning number across all the draws.

We can simply apply the expression in the last cell and loop it through each item say line in NYpowerballLines:

  • int(line[1].split(',')[1].split(' ')[0])

So let's do it.

In [67]:
firstWins = [] # initialize an empty list to append to it each winning first number
for line in NYpowerballLines:
    win1 = int(line.split(',')[1].split(' ')[0])
    firstWins.append(win1)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-67-dd05c5dfbb9f> in <module>()
      1 firstWins = [] # initialize an empty list to append to it each winning first number
      2 for line in NYpowerballLines:
----> 3     win1 = int(line.split(',')[Integer(1)].split(' ')[Integer(0)])
      4     firstWins.append(win1)

ValueError: invalid literal for int() with base 10: 'Winning'

Opps, we got a ValueError: invalid literal for int() with base 10: 'Winning'.

In [68]:
NYpowerballLines[0]   # remember our header line at index 0 has 'Winning' a string ad this can't be cast to int
Out[68]:
u'Draw Date,Winning Numbers,Multiplier'

We need to handle such error carefully, as data downloaded from public sources will generally come woth such caveats and more complicated issues in general.

A proper way to handle such errors safely is called 'Exception Handling' using try and catch. Briefly from the docs:

The try statement works as follows.

  • First, the try clause (the statement(s) between the try and except keywords) is executed.
  • If no exception occurs, the except clause is skipped and execution of the try statement is finished.
  • If an exception occurs during execution of the try clause, the rest of the clause is skipped. Then if its type matches the exception named after the except keyword, the except clause is executed, and then execution continues after the try statement.
  • If an exception occurs which does not match the exception named in the except clause, it is passed on to outer try statements; if no handler is found, it is an unhandled exception and execution stops with a message as shown above.
In [69]:
# let's be cautious and process only the first 10 lines
firstWins = [] # initialize an empty list to append to it each winning first number
for line in NYpowerballLines[0:10]:
    try:
        win1 = int(line.split(',')[1].split(' ')[0])
        firstWins.append(win1)
    except ValueError:
        print "Err.. numbers only - input was: ", line
Err.. numbers only - input was:  Draw Date,Winning Numbers,Multiplier
In [70]:
firstWins # looks like we gracefully converted the first winning numbers for lines 2-10 indexed 1-9
Out[70]:
[17, 14, 5, 10, 7, 13, 4, 18, 7]
In [71]:
# now let's bravely process all the lines
firstWins = [] # initialize an empty list to append to it each winning first number
for line in NYpowerballLines:
    try:
        win1 = int(line.split(',')[1].split(' ')[0])
        firstWins.append(win1)
    except ValueError:
        print "Err.. numbers only - line was: ", line       
Err.. numbers only - line was:  Draw Date,Winning Numbers,Multiplier
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-71-50f11f62e63f> in <module>()
      3 for line in NYpowerballLines:
      4     try:
----> 5         win1 = int(line.split(',')[Integer(1)].split(' ')[Integer(0)])
      6         firstWins.append(win1)
      7     except ValueError:

IndexError: list index out of range

Now we have a new error IndexError: list index out of range. But let's see how many numbers have been appended in our list firstWins.

In [72]:
len(firstWins)
Out[72]:
864
In [73]:
NYpowerballLines[864] # this grabs the second last row of data - the winning numbers from May 12 2018
Out[73]:
u'05/16/2018,17 19 21 22 51 19,2'
In [74]:
len(NYpowerballLines) # but there is another line at index 865
Out[74]:
866
In [75]:
NYpowerballLines[865] # contents of the last line looks empty! this is our problem
Out[75]:
u''

So, we need to handle this new exception.

From the docs we note that:

A try statement may have more than one except clause, to specify handlers for different exceptions. At most one handler will be executed....

Let's have an except clause for IndexError next.

In [76]:
# now let's bravely process all the lines
firstWins = [] # initialize an empty list to append to it each winning first number
for line in NYpowerballLines:
    try:
        win1 = int(line.split(',')[1].split(' ')[0])
        firstWins.append(win1)
    except ValueError:
        print "Err.. numbers only - line was: ", line
    except IndexError:
        print "list length had IndexError - line was: ", line
Err.. numbers only - line was:  Draw Date,Winning Numbers,Multiplier
list length had IndexError - line was:  
In [77]:
len(firstWins) # now we got an exception-handled data processing done!
Out[77]:
864
In [78]:
print firstWins[0:10], "...", firstWins[840:868] # the first and last 10 balls in our list
[17, 14, 5, 10, 7, 13, 4, 18, 7, 10] ... [24, 12, 13, 6, 43, 6, 22, 3, 10, 6, 8, 8, 2, 16, 17, 9, 40, 17, 20, 5, 14, 11, 22, 17]

Congratulations! You have taken your first step towards being a data scientist!

Data scientists are expected to directly work with data and not merely load data from pre-made libraries.

More crucially, mathematical statistical data scientists are expected to be comfortable with axiomatic probability and mathematical statistical theory (we will learn more in the sequel).

Finally, computational mathematical statistical data scientists, students being trained here, can apply (and even develop as needed in cooperation with domain experts) the right model, methods and theory for a given data science problem and solve it using the most appropriate computational tools.

Arrays

YouTrys!

We have already talked about lists in SAGE. Lists are great but are also general: lists are designed to be able to cope with any sort of data but that also means they don't have some of the specific functionality we might like to be able to analyse numerical data sets. Arrays provide a way of representing and manipulating numerical data known an a matrix in maths lingo. Matrices are particularly useful for data science.

The lists we have been dealing with were one-dimensional. An array and the matrix it represents in a computer can be multidimensional. The most common forms of array that we will meet are one-dimensional and two-dimensional. You can imagine a two-dimensional array as a grid, which has rows and columns. Take for instance a 3-by-6 array (i.e., 3 rows and 6 columns).

0 1 2 3 4 5
6 7 8 9 10 11
12 13 14 15 16 17

The array type is not available in the basic SageMath package, so we import a library package called numpy which includes arrays.

In [79]:
import numpy as np

# this is just to print
print 'numpy is imported'
numpy is imported

Because the arrays are part of numpy, not the basic SageMath package, we don't just say array, but instead qualify array with the name of the module it is in. So, we say np.array since we did import numpy as np.

The cell below makes a two-dimensional 3-by-6 (3 rows by 6 columns) array by specifying the rows, and each element in each row, individually.

In [80]:
array1 = np.array([[0,1,2,3,4,5],[6,7,8,9,10,11],[12,13,14,15,16,17]]) # make an array the hard way
array1
Out[80]:
array([[ 0,  1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10, 11],
       [12, 13, 14, 15, 16, 17]])

We can use the array's shape method to find out its shape. The shape method for a two-dimensional array returns an ordered pair in the form (rows,columns). In this case it tells us that we have 3 rows and 6 columns.

In [81]:
array1.shape
Out[81]:
(3, 6)

Another way to make an array is to start with a one-dimensional array and resize it to give the shape we want, as we do in the two cells below.

We start by making a one-dimensional array using the range function we have met earlier.

In [82]:
array2 = np.array([int(i) for i in range(18)])     # make an one dimensional array of Python int's
array2
Out[82]:
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17])

The array's resize allows us to specify a new shape for the same array. We are going to make a two-dimensional array with 3 rows and 6 columns. Note that the shape you specify must be compatible with the number of elements in the array.

In this case, array2 has 18 elements and we specify a new shape of 3 rows and 6 columns, which still makes 18 = 3*6 elements.

If the new shape you specify won't work with the array, you'll get an error message.

In [83]:
array2.resize(3,6)      # use the array's resize method to change the shape of the array 18=3*6
array2
Out[83]:
array([[ 0,  1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10, 11],
       [12, 13, 14, 15, 16, 17]])

If you check the type for array2, you might be surprised to find that it is a numpy.ndarray. What happened to numpy.array? The array we are using is the NumPy N-dimensional array, or numpy.ndarray.

In [84]:
type(array2)
Out[84]:
<type 'numpy.ndarray'>

Tensors and general multidimensional arrays are possible via resize method.

In [85]:
array2.resize(3,2,3)      # use the array's resize method to change the shape of the array 18=3*2*3
array2
Out[85]:
array([[[ 0,  1,  2],
        [ 3,  4,  5]],

       [[ 6,  7,  8],
        [ 9, 10, 11]],

       [[12, 13, 14],
        [15, 16, 17]]])

The range function will only give us ranges of integers. The numpy module includes a more flexible function arange (that's arange with one r -- think of it as something like a-for-array-range -- not 'arrange' with two r's). The arange function takes parameters specifying the start_argument, stop_argument and step_argument values (similar to range), but is not restricted to integers and returns a one-dimensional array rather than a list.

Here we use arange to make an array of numbers from 0.0 (start_argument) going up in steps of 0.1 (step_argument) to 0.18 - 0.1 and just as with range, the last number we get in the array is stop_argument - step_argument = 0.18 - 0.01 = 0.17.

In [86]:
array3 = np.arange(0.0,0.18,0.01)   # the pylab.arange(start, stop, step)
array3
Out[86]:
array([ 0.  ,  0.01,  0.02,  0.03,  0.04,  0.05,  0.06,  0.07,  0.08,
        0.09,  0.1 ,  0.11,  0.12,  0.13,  0.14,  0.15,  0.16,  0.17])
In [87]:
type(array3) # type is actually a numpy.ndarray
Out[87]:
<type 'numpy.ndarray'>
In [88]:
array3.shape # the shape is 1 dimensional
Out[88]:
(18,)

Resize and Reshape

The resize method resizes the array it is applied to.

The reshape method will leave the original array unchanged but return a new array, based on the original one, of the required new shape.

In [89]:
array3.resize(9,2)                     # which we can resize into a 9 by 2 array
array3
Out[89]:
array([[ 0.  ,  0.01],
       [ 0.02,  0.03],
       [ 0.04,  0.05],
       [ 0.06,  0.07],
       [ 0.08,  0.09],
       [ 0.1 ,  0.11],
       [ 0.12,  0.13],
       [ 0.14,  0.15],
       [ 0.16,  0.17]])
In [90]:
array3.shape                            # try to see the shape of array3 now
Out[90]:
(9, 2)
In [91]:
array4 = array3.reshape(6,3)     # reshape makes a new array of the specified size 6 by 3
array4
Out[91]:
array([[ 0.  ,  0.01,  0.02],
       [ 0.03,  0.04,  0.05],
       [ 0.06,  0.07,  0.08],
       [ 0.09,  0.1 ,  0.11],
       [ 0.12,  0.13,  0.14],
       [ 0.15,  0.16,  0.17]])
In [92]:
array4.shape                            # try to see the shape of array4 now
Out[92]:
(6, 3)
In [93]:
array3.shape                            # try to see the shape of array3 now, i.e. after it was reshaped into array4
Out[93]:
(9, 2)

So the reshape does leave the original array unchanged.

Arrays: indexing, slicing and copying

Remember indexing into lists to find the elements at particular positions? We can do this with arrays, but we need to specify the index we want for each dimension of the array. For our two-dimensional arrays, we need to specify the row and column index, i.e. use a format like [row_index,column_index]. For exampe, [0,0] gives us the element in the first column of the first row (as with lists, array indices start from 0 for the first element).

In [94]:
array4 = np.arange(0.0, 0.18, 0.01)   # make sure we have array4
array4.resize(6,3)     

array4
Out[94]:
array([[ 0.  ,  0.01,  0.02],
       [ 0.03,  0.04,  0.05],
       [ 0.06,  0.07,  0.08],
       [ 0.09,  0.1 ,  0.11],
       [ 0.12,  0.13,  0.14],
       [ 0.15,  0.16,  0.17]])

array4[0,0] gives the element in the first row and first column while array4[5,2] gives us the element in the third column (index 2) of the sixth row (index 5).

In [95]:
array4[0,0]
Out[95]:
0.0
In [98]:
array4[5,2]
Out[98]:
0.17000000000000001

We can use the colon to specify a range of columns or rows. For example, [0:4,0] gives us the elements in the first column (index 0) of rows with indices from 0 through 3. Note that the row index range 0:4 gives rows with indices starting from index 0 and ending at index 3=4-1, i.e., indices 0 through 3.

In [99]:
array4[0:4,0]
Out[99]:
array([ 0.  ,  0.03,  0.06,  0.09])

Similarly we could get all the elements in the second column (column index 1) of rows with indices 2 through 4.

In [100]:
array4[2:5,1]
Out[100]:
array([ 0.07,  0.1 ,  0.13])

The colon on its own gives everything, so a column index of : gives all the columns. Thus we can get, for example, all elements of a particular row -- in this case, the third row.

In [101]:
array4[2,:]
Out[101]:
array([ 0.06,  0.07,  0.08])

Or all the elements of a specified column, in this case the first column.

In [102]:
array4[:,0]
Out[102]:
array([ 0.  ,  0.03,  0.06,  0.09,  0.12,  0.15])

Or all the elements of a range of rows (think of this as like slicing the array horizontally to obtain row indices 1 through 4=5-1).

In [103]:
array4[1:5,:]
Out[103]:
array([[ 0.03,  0.04,  0.05],
       [ 0.06,  0.07,  0.08],
       [ 0.09,  0.1 ,  0.11],
       [ 0.12,  0.13,  0.14]])

Or all the elements of a range of columns (think of this as slicing the array vertically to obtain column indices 1 through 2).

In [104]:
array4[:,1:3]
Out[104]:
array([[ 0.01,  0.02],
       [ 0.04,  0.05],
       [ 0.07,  0.08],
       [ 0.1 ,  0.11],
       [ 0.13,  0.14],
       [ 0.16,  0.17]])

Naturally, we can slice both horizontally and vertically to obtain row indices 2 through 4 and column indices 0 through 1, as follows:

In [105]:
array4[2:5,0:2]     
Out[105]:
array([[ 0.06,  0.07],
       [ 0.09,  0.1 ],
       [ 0.12,  0.13]])

Finally, [:] gives a copy of the whole array. This copy of the original array can be assigned to a new name for furter manipulation without affecting the original array.

In [111]:
CopyOfArray4 = array4[:]         # assign a copy of array4 to the new array named CopyOfArray4
CopyOfArray4                     # disclose CopyOfArray4
Out[111]:
array([[ 0.  ,  0.01,  0.02],
       [ 0.03,  0.04,  0.05],
       [ 0.06,  0.07,  0.08],
       [ 0.09,  0.1 ,  0.11],
       [ 0.12,  0.13,  0.14],
       [ 0.15,  0.16,  0.17]])
In [107]:
CopyOfArray4.resize(9,2) # resize CopyOfArray4 from a 6-by-3 array to a 9-by-2 array
CopyOfArray4             # disclose CopyOfArray4 as it is now
Out[107]:
array([[ 0.  ,  0.01],
       [ 0.02,  0.03],
       [ 0.04,  0.05],
       [ 0.06,  0.07],
       [ 0.08,  0.09],
       [ 0.1 ,  0.11],
       [ 0.12,  0.13],
       [ 0.14,  0.15],
       [ 0.16,  0.17]])
In [108]:
array4             # note that our original array4 has remained unchanged with size 6-by-3
Out[108]:
array([[ 0.  ,  0.01,  0.02],
       [ 0.03,  0.04,  0.05],
       [ 0.06,  0.07,  0.08],
       [ 0.09,  0.1 ,  0.11],
       [ 0.12,  0.13,  0.14],
       [ 0.15,  0.16,  0.17]])

Useful arrays

NumPy provides quick ways of making some useful kinds of arrays. Some of these are shown in the cells below.

An array of zeros or ones of a particular shape. Here we ask for shape (2,3), i.e., 2 rows and 3 columns.

In [109]:
arrayOfZeros = np.zeros((2,3)) # get a 2-by-3 array of zeros
arrayOfZeros
Out[109]:
array([[ 0.,  0.,  0.],
       [ 0.,  0.,  0.]])
In [110]:
arrayOfOnes = np.ones((2,3)) # get a 2-by-3 array of ones
arrayOfOnes
Out[110]:
array([[ 1.,  1.,  1.],
       [ 1.,  1.,  1.]])

An array for the identity matrix, i.e., square (same number of elements on each dimension) matrix with 1's along the diagonal and 0's everywhere else.

In [112]:
iden = np.identity(3) # get a 3-by-3 identity matrix with 1's along the diagonal and 0's elsewhere
iden
Out[112]:
array([[ 1.,  0.,  0.],
       [ 0.,  1.,  0.],
       [ 0.,  0.,  1.]])

Useful functions for sequences

Now we are going to demonstrate some useful methods of sequences. A list is a sequence, and so is a tuple.

There are differences between them (recall that tuples are immutable) but in many cases we can use the same methods on them because they are both sequences.

We will demonstrate with a list and a tuple containing the same data values. The data could be the results of three IID Bernoulli trials..

In [113]:
obsDataList = [0, 1, 1]
obsDataList
Out[113]:
[0, 1, 1]
In [114]:
obsDataTuple = tuple(obsDataList)
obsDataTuple
Out[114]:
(0, 1, 1)

A useful operation we can perform on a tuple is to count the number of times a particular element, say 0 or 1 in our ObsDataTuple, occurs. This is a statistic of our data called the sample frequency of the element of interest.

In [115]:
obsDataTuple.count(0)        # frequency of 0 using the tuple
Out[115]:
1
In [116]:
obsDataList.count(1)        # frequency of 1 using the list
Out[116]:
2

Another useful operation we can perform on a tuple is to sum all elements in our obsDataTuple or further scale the sum by the sample size. These are also statistics of our data called the sample sum and the sample mean, respectively. Try the same things with obsDataList.

In [117]:
sum(obsDataTuple)            # sample sum
Out[117]:
2
In [118]:
sum(obsDataTuple)/3          # sample mean using sage.rings.Rational
Out[118]:
2/3
In [119]:
obsDataTuple.count(1) / 3    # alternative expression for sample mean in IID Bernoulli trials
Out[119]:
2/3

We used a lot of the techniques we have seen so far when we wanted to get the relative frequency associated with each ball in the lotto data.

Let's revist the code that made relative frequencies and fully understand each part of it now.

In [120]:
ballOneFreqs = makeFreqDict(listBallOne)   # call the function to make the dictionary
totalCounts = sum(ballOneFreqs.values())
relFreqs = []
for k in ballOneFreqs.keys():
    relFreqs.append(k/totalCounts)
numRelFreqPairs = zip(ballOneFreqs.keys(), relFreqs) # zip the keys and relative frequencies together
print(numRelFreqPairs)
[(1, 1/1114), (2, 1/557), (3, 3/1114), (4, 2/557), (5, 5/1114), (6, 3/557), (7, 7/1114), (8, 4/557), (9, 9/1114), (10, 5/557), (11, 11/1114), (12, 6/557), (13, 13/1114), (14, 7/557), (15, 15/1114), (16, 8/557), (17, 17/1114), (18, 9/557), (19, 19/1114), (20, 10/557), (21, 21/1114), (22, 11/557), (23, 23/1114), (24, 12/557), (25, 25/1114), (26, 13/557), (27, 27/1114), (28, 14/557), (29, 29/1114), (30, 15/557), (31, 31/1114), (32, 16/557), (33, 33/1114), (34, 17/557), (35, 35/1114), (36, 18/557), (37, 37/1114), (38, 19/557), (39, 39/1114), (40, 20/557)]

We could also do it with a list comprehension as follows:

In [121]:
ballOneFreqs = makeFreqDict(listBallOne)   # call the function to make the dictionary
totalCounts = sum(ballOneFreqs.values())
relFreqs = [k/totalCounts for k in ballOneFreqs.keys()]
numRelFreqPairs = zip(ballOneFreqs.keys(), relFreqs) # zip the keys and relative frequencies together
print(numRelFreqPairs)
[(1, 1/1114), (2, 1/557), (3, 3/1114), (4, 2/557), (5, 5/1114), (6, 3/557), (7, 7/1114), (8, 4/557), (9, 9/1114), (10, 5/557), (11, 11/1114), (12, 6/557), (13, 13/1114), (14, 7/557), (15, 15/1114), (16, 8/557), (17, 17/1114), (18, 9/557), (19, 19/1114), (20, 10/557), (21, 21/1114), (22, 11/557), (23, 23/1114), (24, 12/557), (25, 25/1114), (26, 13/557), (27, 27/1114), (28, 14/557), (29, 29/1114), (30, 15/557), (31, 31/1114), (32, 16/557), (33, 33/1114), (34, 17/557), (35, 35/1114), (36, 18/557), (37, 37/1114), (38, 19/557), (39, 39/1114), (40, 20/557)]

More on tuples and sequences in general

First, we can make an empty tuple like this:

In [122]:
empty = ()
len(empty)
Out[122]:
0

Secondly, if we want a tuple with only one element we have to use a trailing comma. If you think about it from the computer's point of view, if it can't see the trailing comma, how is it to tell that you want a tuple not just a single number?

In [123]:
aFineSingletonTuple = (1, )
type(aFineSingletonTuple)
Out[123]:
<type 'tuple'>
In [124]:
unFineSingletonTuple = (1)
type(unFineSingletonTuple)
Out[124]:
<type 'sage.rings.integer.Integer'>
In [125]:
unFineSingletonTuple = ((1))
type(unFineSingletonTuple)
Out[125]:
<type 'sage.rings.integer.Integer'>
In [126]:
aFineSingletonTuple = ('lonelyString', )
type(aFineSingletonTuple)
Out[126]:
<type 'tuple'>
In [127]:
unFineSingletonTuple = ('lonelyString')
type(unFineSingletonTuple)
Out[127]:
<type 'str'>

You can make tuples out almost any object. It's as easy as putting the things you want in ( ) parentheses and separating them by commas.

In [128]:
myTupleOfStuff = (1, ZZ(1), QQ(1), RR(1), int(1), long(1), float(1))
myTupleOfStuff
Out[128]:
(1, 1, 1, 1.00000000000000, 1, 1L, 1.0)
In [129]:
1==ZZ(1) # remember? will this be True/False?
Out[129]:
True

Actually, you don't even need the ( ). If you present SageMath (Python) with a sequence of object separated by commas, the default result is a tuple.

In [130]:
myTuple2 = 60, 400    # assign a tuple to the variable named myTuple2
type(myTuple2)
Out[130]:
<type 'tuple'>

A statement like the one in the cell above is known as tuple packing (more generally, sequence packing).

When we work with tuples we also often use the opposite of packing - Python's very useful sequence unpacking capability. This is, literally, taking a sequence and unpacking or extracting its elements.

In [131]:
x, y = myTuple2     # tuple unpacking
print x
print y
x * y
60
400
Out[131]:
24000

The statement below is an example of multiple assignment, which you can see now is really just a combination of tuple packing followed by unpacking.

In [132]:
x, y = 600.0, 4000.0
print x
print y
x/y
600.000000000000
4000.00000000000
Out[132]:
0.150000000000000

Let's try to implement a for loop with a tuple -- it will work just like the for loop with a list.

In [133]:
for x in (1,2,3):
    print x^2
1
4
9

Next let's try a list comprehension on the tuple. This creates a list as expected.

In [134]:
[x^2 for x in (1,2,3)]
Out[134]:
[1, 4, 9]

But if we try the same comprehension with tuples, i.e., with () insteqad of [] we get a generator object. This will be covered next by first easing into functional programming in Python.

In [135]:
(x^2 for x in (1,2,3))
Out[135]:
<generator object <genexpr> at 0x7f394e863500>

Functional Programming in Python

Let us take at the basics of functional programming and what Python has to offer on this front at:

In [2]:
showURL("https://docs.python.org/2/howto/functional.html#functional-programming-howto",600)
Out[2]:

Iterators

From https://docs.python.org/2/howto/functional.html#iterators:

An iterator is an object representing a stream of data; this object returns the data one element at a time. A Python iterator must support a method called next() that takes no arguments and always returns the next element of the stream. If there are no more elements in the stream, next() must raise the StopIteration exception. Iterators don’t have to be finite, though; it’s perfectly reasonable to write an iterator that produces an infinite stream of data.

The built-in iter() function takes an arbitrary object and tries to return an iterator that will return the object’s contents or elements, raising TypeError if the object doesn’t support iteration. Several of Python’s built-in data types support iteration, the most common being lists and dictionaries. An object is called an iterable object if you can get an iterator for it.

You can experiment with the iteration interface manually:

In [137]:
L = [1,2,3]
it = iter(L)
it
Out[137]:
<listiterator object at 0x7f394e89f4d0>
In [138]:
type(it)
Out[138]:
<type 'listiterator'>
In [139]:
it.next()
Out[139]:
1
In [140]:
it.next()
Out[140]:
2
In [141]:
it.next()
Out[141]:
3
In [142]:
it.next()
---------------------------------------------------------------------------
StopIteration                             Traceback (most recent call last)
<ipython-input-142-54f0920595b2> in <module>()
----> 1 it.next()

StopIteration: 

Iterators can be materialized as lists or tuples by using the list() or tuple() constructor functions:

In [143]:
L = [1,2,3]
it = iter(L)
t = tuple(it)
t
Out[143]:
(1, 2, 3)
In [144]:
L = [1,2,3]
it = iter(L)
l = list(it)
l
Out[144]:
[1, 2, 3]
In [147]:
it = iter(L)
max(it)  # you can call functions on iterators
Out[147]:
3

Generators

"Naive Tuple Comprehension" creates a generator in SageMath/Python. From https://docs.python.org/2/glossary.html#term-generator:

generator A function which returns an iterator. It looks like a normal function except that it contains yield statements for producing a series of values usable in a for-loop or that can be retrieved one at a time with the next() function. Each yield temporarily suspends processing, remembering the location execution state (including local variables and pending try-statements). When the generator resumes, it picks-up where it left-off (in contrast to functions which start fresh on every invocation).

generator expression An expression that returns an iterator. It looks like a normal expression followed by a for expression defining a loop variable, range, and an optional if expression. The combined expression generates values for an enclosing function:

List comprehensions (listcomps) and Generator Expressions (genexps) from

With a list comprehension, you get back a Python list; stripped_list is a list containing the resulting lines, not an iterator. Generator expressions return an iterator that computes the values as necessary, not needing to materialize all the values at once. This means that list comprehensions aren’t useful if you’re working with iterators that return an infinite stream or a very large amount of data. Generator expressions are preferable in these situations. Generator expressions are surrounded by parentheses (“()”) and list comprehensions are surrounded by square brackets (“[]”). Generator expressions have the form:

( expression for expr in sequence1
             if condition1
             for expr2 in sequence2
             if condition2
             for expr3 in sequence3 ...
             if condition3
             for exprN in sequenceN
             if conditionN )

Two common operations on an iterator’s output are 1) performing some operation for every element, 2) selecting a subset of elements that meet some condition. For example, given a list of strings, you might want to strip off trailing whitespace from each line or extract all the strings containing a given substring.

List comprehensions and generator expressions (short form: “listcomps” and “genexps”) are a concise notation for such operations, borrowed from the functional programming language Haskell (https://www.haskell.org/). You can strip all the whitespace from a stream of strings with the following code:

In [148]:
line_list = ['  line 1\n', 'line 2  \n', 'line 3   \n']

# List comprehension -- returns list
stripped_list = [line.strip() for line in line_list]

# Generator expression -- returns a generator
stripped_iter = (line.strip() for line in line_list)
In [149]:
stripped_list # returns a list
Out[149]:
['line 1', 'line 2', 'line 3']
In [150]:
type(stripped_iter) # returns a generator
Out[150]:
<type 'generator'>
In [151]:
stripped_iter # returns only an genexpr generator object
Out[151]:
<generator object <genexpr> at 0x7f394e863140>
In [152]:
list(stripped_iter) # the generator can be materialized into a list 
Out[152]:
['line 1', 'line 2', 'line 3']
In [153]:
tuple(stripped_iter) # the generator can be materialized into a tuple - it's already been emptied! 
Out[153]:
()
In [154]:
# Generator expression -- returns a generator
stripped_iter = (line.strip() for line in line_list) # let's create a new generator
tuple(stripped_iter) # the generator can be materialized into a tuple now
Out[154]:
('line 1', 'line 2', 'line 3')

Once again we have emptied the stripped_iter so we only get an empty list when we materilize it into a list.

In [155]:
list(stripped_iter) 
Out[155]:
[]
In [3]:
showURL("https://docs.python.org/2/howto/functional.html#generator-expressions-and-list-comprehensions",500)
Out[3]:
In [157]:
myGenerator = (x^2 for x in (1,2,3)) # this is actually a generator
In [158]:
type(myGenerator)
Out[158]:
<type 'generator'>
In [159]:
myGenerator.next()
Out[159]:
1
In [160]:
myGenerator.next()
Out[160]:
4
In [161]:
myGenerator.next()
Out[161]:
9
In [162]:
myGenerator.next()
---------------------------------------------------------------------------
StopIteration                             Traceback (most recent call last)
<ipython-input-162-14b7fa77bd58> in <module>()
----> 1 myGenerator.next()

StopIteration: 

We can pass a generator to a function like sum to evaluate it:

In [163]:
myGenerator = (i*i for i in range(10)) 
In [164]:
sum(myGenerator)
Out[164]:
285

See https://docs.python.org/2/howto/functional.html#generators for more details:

Generators are a special class of functions that simplify the task of writing iterators. Regular functions compute a value and return it, but generators return an iterator that returns a stream of values.

You’re doubtless familiar with how regular function calls work in Python or C. When you call a function, it gets a private namespace where its local variables are created. When the function reaches a return statement, the local variables are destroyed and the value is returned to the caller. A later call to the same function creates a new private namespace and a fresh set of local variables. But, what if the local variables weren’t thrown away on exiting a function? What if you could later resume the function where it left off? This is what generators provide; they can be thought of as resumable functions.

Here’s the simplest example of a generator function:

In [165]:
def generate_ints(N):
    for i in range(N):
        yield i

Any function containing a yield keyword is a generator function; this is detected by Python’s bytecode compiler which compiles the function specially as a result.

When you call a generator function, it doesn’t return a single value; instead it returns a generator object that supports the iterator protocol. On executing the yield expression, the generator outputs the value of i, similar to a return statement. The big difference between yield and a return statement is that on reaching a yield the generator’s state of execution is suspended and local variables are preserved. On the next call to the generator’s .next() method, the function will resume executing.

Here’s a sample usage of the generate_ints() generator:

In [166]:
gen = generate_ints(3)
gen
Out[166]:
<generator object generate_ints at 0x7f394e8c0d20>
In [167]:
gen.next()
Out[167]:
0
In [168]:
gen.next()
Out[168]:
1
In [169]:
gen.next()
Out[169]:
2
In [170]:
gen.next()
---------------------------------------------------------------------------
StopIteration                             Traceback (most recent call last)
<ipython-input-170-b2c61ce5e131> in <module>()
----> 1 gen.next()

StopIteration: 
In [212]:
gen. # see the methods available on gen by hitting TAB after the 'gen.' try to `close()` the generator

Using list comprehensions and generator expressions one can do functional programming in Python. But this is not the same as doing pure functional programming in a language such as haskell (see https://www.haskell.org/) that is designed specifically for pure funcitonal programming.

Keep learning

We have taken several exercises above from http://docs.python.org/tutorial/datastructures.html.

This is a very useful site to look for user-friendly information about Python in general (it can't help you with anything that Sage overlays on top of Python though, or with many of the specialised modules that you may want to import into your Sage notebook).

A Chalenging YouTry!

Recall how we downloaded Pride and Prejudice and processed it as a String and break it by Chapters. This is at our disposal - all we need to do is copy-paste the right set of cells from earlier here to have the string from that Book by fetching live from the Guthenberg Project again.

Think about what algorithmic constructs and methods one will need to split each sentence by words it contains and count the number of each distinct word.

Now that you have understood for loops, list comprehensions and anonymous functions, and can learn about the needed methods on strings for splitting (which you can search by adding a . after a srt and hitting the Tab button to look through exixting methods), the dictionary data structure, and already seen how to count the number of ball labels, you are ready for the challenge.

Process the English words in the book Pride and Prejudice by Jane Austin and obtain the empirical mass function of the 50 most frequent words. You may report it as a dictionary. If you have more time them try to plot the empirical mass function (at least for the top 20 words) as a picture/image as we did above.

In [ ]:
 
In [ ]:
 

A YouTryIfYouWantTo on a Really Interesting Dataset

Swedish Election Outcomes 2018

See: http://www.lamastex.org/datasets/public/elections/2018/sv/README!

It should say something like the following:

This was obtained by processing using the scripts at:

Download and extract using tar:

$ wget http://www.lamastex.org/datasets/public/elections/2018/sv/final.tgz
$ tar zxvf final.tgz

Now you have all of the data in the csv file:

$ head final.csv
region,municipality,district,party,votes
Blekinge län,Karlshamn,0 - Centrala Asarum,S,519
Blekinge län,Karlshamn,0 - Centrala Asarum,SD,311
Blekinge län,Karlshamn,0 - Centrala Asarum,M,162
Blekinge län,Karlshamn,0 - Centrala Asarum,V,82
Blekinge län,Karlshamn,0 - Centrala Asarum,KD,53
Blekinge län,Karlshamn,0 - Centrala Asarum,C,37
Blekinge län,Karlshamn,0 - Centrala Asarum,L,37
Blekinge län,Karlshamn,0 - Centrala Asarum,MP,32
Blekinge län,Karlshamn,0 - Centrala Asarum,BLANK,13

Flex your Pythonic Muscles and create any interesting statistics of the dataset!

This is just for fun and is a YouTryIfYouWantTo.

In [ ]: