%md
# Wiki Clickstream Analysis
** Dataset: 3.2 billion requests collected during the month of February 2015 grouped by (src, dest) **
** Source: https://datahub.io/dataset/wikipedia-clickstream/ **

*This notebook requires Spark 1.6+.*
Wiki Clickstream Analysis
Dataset: 3.2 billion requests collected during the month of February 2015 grouped by (src, dest)
Source: https://datahub.io/dataset/wikipedia-clickstream/
This notebook requires Spark 1.6+.
%md
This notebook was originally a data analysis workflow developed with [Databricks Community Edition](https://databricks.com/blog/2016/02/17/introducing-databricks-community-edition-apache-spark-for-all.html), a free version of Databricks designed for learning [Apache Spark](https://spark.apache.org/).
Here we elucidate the original python notebook ([also linked here](/#workspace/scalable-data-science/xtraResources/sparkSummitEast2016/Wikipedia Clickstream Data)) used in the talk by Michael Armbrust at Spark Summit East February 2016
shared from [https://twitter.com/michaelarmbrust/status/699969850475737088](https://twitter.com/michaelarmbrust/status/699969850475737088)
(watch later)
[](https://www.youtube.com/v/35Y-rqSMCCA)
This notebook was originally a data analysis workflow developed with Databricks Community Edition, a free version of Databricks designed for learning Apache Spark.
Here we elucidate the original python notebook (also linked here) used in the talk by Michael Armbrust at Spark Summit East February 2016 shared from https://twitter.com/michaelarmbrust/status/699969850475737088 (watch later)
%md
### Data set
#
The data we are exploring in this lab is the February 2015 English Wikipedia Clickstream data, and it is available here: http://datahub.io/dataset/wikipedia-clickstream/resource/be85cc68-d1e6-4134-804a-fd36b94dbb82.
According to Wikimedia:
>"The data contains counts of (referer, resource) pairs extracted from the request logs of English Wikipedia. When a client requests a resource by following a link or performing a search, the URI of the webpage that linked to the resource is included with the request in an HTTP header called the "referer". This data captures 22 million (referer, resource) pairs from a total of 3.2 billion requests collected during the month of February 2015."
The data is approximately 1.2GB and it is hosted in the following Databricks file: `/databricks-datasets/wikipedia-datasets/data-001/clickstream/raw-uncompressed`
Data set
The data we are exploring in this lab is the February 2015 English Wikipedia Clickstream data, and it is available here: http://datahub.io/dataset/wikipedia-clickstream/resource/be85cc68-d1e6-4134-804a-fd36b94dbb82.
According to Wikimedia:
"The data contains counts of (referer, resource) pairs extracted from the request logs of English Wikipedia. When a client requests a resource by following a link or performing a search, the URI of the webpage that linked to the resource is included with the request in an HTTP header called the "referer". This data captures 22 million (referer, resource) pairs from a total of 3.2 billion requests collected during the month of February 2015."
The data is approximately 1.2GB and it is hosted in the following Databricks file: /databricks-datasets/wikipedia-datasets/data-001/clickstream/raw-uncompressed
%md
### Let us first understand this Wikimedia data set a bit more
Let's read the datahub-hosted link [https://datahub.io/dataset/wikipedia-clickstream](https://datahub.io/dataset/wikipedia-clickstream) in the embedding below. Also click the [blog](http://ewulczyn.github.io/Wikipedia_Clickstream_Getting_Started/) by Ellery Wulczyn, Data Scientist at The Wikimedia Foundation, to better understand how the data was generated (remember to Right-Click and use -> and <- if navigating within the embedded html frame below).
Let us first understand this Wikimedia data set a bit more
Let's read the datahub-hosted link https://datahub.io/dataset/wikipedia-clickstream in the embedding below. Also click the blog by Ellery Wulczyn, Data Scientist at The Wikimedia Foundation, to better understand how the data was generated (remember to Right-Click and use -> and <- if navigating within the embedded html frame below).
//This allows easy embedding of publicly available information into any other notebook
//when viewing in git-book just ignore this block - you may have to manually chase the URL in frameIt("URL").
//Example usage:
// displayHTML(frameIt("https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation#Topics_in_LDA",250))
def frameIt( u:String, h:Int ) : String = {
"""<iframe
src=""""+ u+""""
width="95%" height="""" + h + """"
sandbox>
<p>
<a href="http://spark.apache.org/docs/latest/index.html">
Fallback link for browsers that, unlikely, don't support frames
</a>
</p>
</iframe>"""
}
displayHTML(frameIt("https://datahub.io/dataset/wikipedia-clickstream",500))
val data = sc.textFile("dbfs:///databricks-datasets/wikipedia-datasets/data-001/clickstream/raw-uncompressed")
%md
* The first line looks like a header
* The second line (separated from the first by ",") contains data organized according to the header, i.e., `prev_id` = 3632887, `curr_id` = 121", and so on.
Actually, here is the meaning of each column:
- `prev_id`: if the referer does not correspond to an article in the main namespace of English Wikipedia, this value will be empty. Otherwise, it contains the unique MediaWiki page ID of the article corresponding to the referer i.e. the previous article the client was on
- `curr_id`: the MediaWiki unique page ID of the article the client requested
- `prev_title`: the result of mapping the referer URL to the fixed set of values described below
- `curr_title`: the title of the article the client requested
- `n`: the number of occurrences of the (referer, resource) pair
- `type`
- "link" if the referer and request are both articles and the referer links to the request
- "redlink" if the referer is an article and links to the request, but the request is not in the production enwiki.page table
- "other" if the *referer* and request are both articles but the referer does not link to the request. This can happen when clients search or spoof their refer
- The first line looks like a header
- The second line (separated from the first by ",") contains data organized according to the header, i.e.,
prev_id
= 3632887,curr_id
= 121", and so on.
Actually, here is the meaning of each column:
prev_id
: if the referer does not correspond to an article in the main namespace of English Wikipedia, this value will be empty. Otherwise, it contains the unique MediaWiki page ID of the article corresponding to the referer i.e. the previous article the client was oncurr_id
: the MediaWiki unique page ID of the article the client requestedprev_title
: the result of mapping the referer URL to the fixed set of values described belowcurr_title
: the title of the article the client requestedn
: the number of occurrences of the (referer, resource) pairtype
- "link" if the referer and request are both articles and the referer links to the request
- "redlink" if the referer is an article and links to the request, but the request is not in the production enwiki.page table
- "other" if the referer and request are both articles but the referer does not link to the request. This can happen when clients search or spoof their refer
ScaDaMaLe Course site and book