06. Statistics and List Comprehensions with New Zealand Earthquakes¶

Inference Theory 1 ¶

Live Data-fetch of NZ EQ Data
More on Statistics
Sample Mean
Sample Variance
Order Statistics
Frequencies
Empirical Mass Function
Empirical Distribution Function
List Comprehensions
New Zealand Earthquakes
Functions Revision

Live Data-fetching Exercise Now¶

Go to https://quakesearch.geonet.org.nz/ and download data on NZ earthquakes.

In my attempt above to zoom out to include both islands of New Zealand (NZ) and get one year of data using the Last Year button choice from this site:

https://quakesearch.geonet.org.nz/ and hitting Search box gave the following URLs for downloading data. I used the DOWNLOAD button to get my own data in Outpur Format CSV as chosen earlier.

https://quakesearch.geonet.org.nz/csv?bbox=163.52051,-49.23912,182.19727,-32.36140&startdate=2017-06-01&enddate=2018-05-17T14:00:00 https://quakesearch.geonet.org.nz/csv?bbox=163.52051,-49.23912,182.19727,-32.36140&startdate=2017-5-17T13:00:00&enddate=2017-06-01

What should you do now?¶

Try to DOWNLOAD your own CSV data and store it in a file named my_earthquakes.csv (NOTE: rename the file when you download so you don't replace the file earthquakes.csv!) inside the folder named data that is inside the same directory that this notebook is in.

%%sh
# print working directory
pwd

/home/raazesh/all/git/scalable-data-science/_infty/2018/01/jp

%%sh
ls # list contents of working directory

00.html
00.ipynb
00.md
01.html
01.ipynb
01.md
02.html
02.ipynb
02.md
03.html
03.ipynb
03.md
04.html
04.ipynb
04.md
05.ipynb
05.md
06.ipynb
06.md
07.ipynb
07.md
08.ipynb
08.md
09.ipynb
09.md
10.ipynb
10.md
11.ipynb
11.md
12.ipynb
12.md
data
images
mscFounds
mydir

%%sh
# after download you should have the following file in directory named data
ls data

earthquakes.csv
my_earthquakes.csv
rainfallInChristchurch.csv

%%sh  
# first three lines
head -3 data/earthquakes.csv

publicid,eventtype,origintime,modificationtime,longitude, latitude, magnitude, depth,magnitudetype,depthtype,evaluationmethod,evaluationstatus,evaluationmode,earthmodel,usedphasecount,usedstationcount,magnitudestationcount,minimumdistance,azimuthalgap,originerror,magnitudeuncertainty
2018p368955,,2018-05-17T12:19:35.516Z,2018-05-17T12:21:54.953Z,178.4653957,-37.51944533,2.209351541,20.9375,M,,NonLinLoc,,automatic,nz3drx,12,12,6,0.1363924727,261.0977462,0.8209633086,0
2018p368878,,2018-05-17T11:38:24.646Z,2018-05-17T11:40:26.254Z,177.8775115,-37.46115663,2.155154561,58.4375,M,,NonLinLoc,,automatic,nz3drx,11,11,7,0.3083220739,232.7487132,0.842884174,0

%%sh 
# last three lines
tail -3 data/earthquakes.csv

2017p408155,earthquake,2017-06-01T00:25:06.491Z,2017-06-19T03:05:19.468Z,172.4586182,-42.75725555,2.265096095,11.19184685,M,,LOCSAT,confirmed,manual,iasp91,26,24,13,0.1402618885,61.97485352,0.6915929023,0
2017p408137,earthquake,2017-06-01T00:15:55.074Z,2017-06-19T03:02:21.311Z,176.7870483,-37.73429108,2.276142644,3.937263489,M,,LOCSAT,confirmed,manual,iasp91,28,23,15,0.155876264,96.37628555,0.5852246521,0
2017p408120,earthquake,2017-06-01T00:07:04.890Z,2017-06-01T07:20:23.994Z,175.4930025,-39.31558765,1.298107247,13.5546875,M,,NonLinLoc,confirmed,manual,nz3drx,28,19,13,0.04550182409,86.69529793,0.2189521352,0

%%sh  
# number of lines in the file; menmonic from `man wc` is wc = word-count option=-l is for lines
wc -l  data/earthquakes.csv

21017 data/earthquakes.csv

%%sh
man wc

WC(1)                            User Commands                           WC(1)

NAME
       wc - print newline, word, and byte counts for each file

SYNOPSIS
       wc [OPTION]... [FILE]...
       wc [OPTION]... --files0-from=F

DESCRIPTION
       Print newline, word, and byte counts for each FILE, and a total line if
       more than one FILE is specified.  A word is a non-zero-length  sequence
       of characters delimited by white space.

       With no FILE, or when FILE is -, read standard input.

       The  options  below  may  be  used  to select which counts are printed,
       always in the following order: newline, word, character, byte,  maximum
       line length.

       -c, --bytes
              print the byte counts

       -m, --chars
              print the character counts

       -l, --lines
              print the newline counts

       --files0-from=F
              read  input  from the files specified by NUL-terminated names in
              file F; If F is - then read names from standard input

       -L, --max-line-length
              print the maximum display width

       -w, --words
              print the word counts

       --help display this help and exit

       --version
              output version information and exit

AUTHOR
       Written by Paul Rubin and David MacKenzie.

REPORTING BUGS
       GNU coreutils online help: <http://www.gnu.org/software/coreutils/>
       Report wc translation bugs to <http://translationproject.org/team/>

COPYRIGHT
       Copyright © 2016 Free Software Foundation, Inc.   License  GPLv3+:  GNU
       GPL version 3 or later <http://gnu.org/licenses/gpl.html>.
       This  is  free  software:  you  are free to change and redistribute it.
       There is NO WARRANTY, to the extent permitted by law.

SEE ALSO
       Full documentation at: <http://www.gnu.org/software/coreutils/wc>
       or available locally via: info '(coreutils) wc invocation'

GNU coreutils 8.25               February 2017                           WC(1)

Let's analyse the measured earth quakes in `data/earthquakes.csv`¶

This will ensure we are all looking at the same file!

But feel free to play with your own data/my_earthquakes.csv on the side.

Exercise:¶

Grab origin-time, lat, lon, magnitude, depth

with open("data/earthquakes.csv") as f:
    reader = f.read()
    
dataList = reader.split('\n')

len(dataList)

21018

dataList[0]

'publicid,eventtype,origintime,modificationtime,longitude, latitude, magnitude, depth,magnitudetype,depthtype,evaluationmethod,evaluationstatus,evaluationmode,earthmodel,usedphasecount,usedstationcount,magnitudestationcount,minimumdistance,azimuthalgap,originerror,magnitudeuncertainty'

myDataAccumulatorList =[]
for data in dataList[1:-2]:
    dataRow = data.split(',')
    myData = [dataRow[4],dataRow[5],dataRow[6]]#,dataRow[7]]
    myFloatData = tuple([float(x) for x in myData])
    myDataAccumulatorList.append(myFloatData)

points(myDataAccumulatorList)

More on Statistics¶

Recall that a statistic is any measureable function of the data: $T(x): \mathbb{X} \rightarrow \mathbb{T}$.

Thus, a statistic $T$ is also an RV that takes values in the space $\mathbb{T}$.

When $x \in \mathbb{X}$ is the observed data, $T(x)=t$ is the observed statistic of the observed data $x$.

Example 1: New Zealand Lotto and 1114 IID $de Moirve$ RVs¶

Let's go back to our New Zealand lotto data.

We showed that for New Zealand lotto (40 balls in the machine, numbered $1, 2, \ldots, 40$), the number on the first ball out of the machine can be modelled as a de Moivre$(\frac{1}{40}, \frac{1}{40}, \ldots, \frac{1}{40})$ RV.

We have the New Zealand Lotto results for 1114 draws, from 1 August 1987 to 10 November 2008 (retrieved from the NZ lotto web site: http://lotto.nzpages.co.nz/previousresults.html ).

We can think of this data as $x$, the realisation of a random vector $X = (X_1, X_2,\ldots, X_{1114})$ where $X_1, X_2,\ldots, X_{1114} \overset{IID}{\thicksim} \text{de Moivre}(\frac{1}{40}, \frac{1}{40}, \ldots, \frac{1}{40})$

The data space is every possible sequence of ball numbers that we could have got in these 1114 draws. $\mathbb{X} = \{1, 2, \ldots, 40\}^{1114}$. There are $40^{1114}$ possible sequences and our data is just one of these $40^{1114}$ possible points in the data space.

We will use our hidden function that enables us to get the ball one data in a list. Evaluate the cell below to get the data and confirm that we have data for 1114 draws.