010_wikipediaClickStream_01ETLEDA

Data set

The data we are exploring in this lab is the February 2015 English Wikipedia Clickstream data, and it is available here: http://datahub.io/dataset/wikipedia-clickstream/resource/be85cc68-d1e6-4134-804a-fd36b94dbb82.

According to Wikimedia:

"The data contains counts of (referer, resource) pairs extracted from the request logs of English Wikipedia. When a client requests a resource by following a link or performing a search, the URI of the webpage that linked to the resource is included with the request in an HTTP header called the "referer". This data captures 22 million (referer, resource) pairs from a total of 3.2 billion requests collected during the month of February 2015."

The data is approximately 1.2GB and it is hosted in the following Databricks file: /databricks-datasets/wikipedia-datasets/data-001/clickstream/raw-uncompressed

display(dbutils.fs.ls("/databricks-datasets/wikipedia-datasets/"))


dbfs:/databricks-datasets/wikipedia-datasets/data-001/	data-001/	0

Show code

if (org.apache.spark.BuildInfo.sparkBranch < "1.6") sys.error("Attach this notebook to a cluster running Spark 1.6+")

val data = sc.textFile("dbfs:///databricks-datasets/wikipedia-datasets/data-001/clickstream/raw-uncompressed")

data: org.apache.spark.rdd.RDD[String] = dbfs:///databricks-datasets/wikipedia-datasets/data-001/clickstream/raw-uncompressed MapPartitionsRDD[582] at textFile at command-112937334110573:1

data.take(5).foreach(println)

prev_id curr_id n prev_title curr_title type 3632887 121 other-google !! other 3632887 93 other-wikipedia !! other 3632887 46 other-empty !! other 3632887 10 other-other !! other

data.take(2)

res4: Array[String] = Array(prev_id curr_id n prev_title curr_title type, " 3632887 121 other-google !! other")

The first line looks like a header
The second line (separated from the first by ",") contains data organized according to the header, i.e., prev_id = 3632887, curr_id = 121", and so on.

Actually, here is the meaning of each column:

prev_id: if the referer does not correspond to an article in the main namespace of English Wikipedia, this value will be empty. Otherwise, it contains the unique MediaWiki page ID of the article corresponding to the referer i.e. the previous article the client was on
curr_id: the MediaWiki unique page ID of the article the client requested
prev_title: the result of mapping the referer URL to the fixed set of values described below
curr_title: the title of the article the client requested
n: the number of occurrences of the (referer, resource) pair
type
- "link" if the referer and request are both articles and the referer links to the request
- "redlink" if the referer is an article and links to the request, but the request is not in the production enwiki.page table
- "other" if the referer and request are both articles but the referer does not link to the request. This can happen when clients search or spoof their refer

Referers were mapped to a fixed set of values corresponding to internal traffic or external traffic from one of the top 5 global traffic sources to English Wikipedia, based on this scheme:

an article in the main namespace of English Wikipedia -> the article title

any Wikipedia page that is not in the main namespace of English Wikipedia -> other-wikipedia

an empty referer -> other-empty

a page from any other Wikimedia project -> other-internal

Google -> other-google

Yahoo -> other-yahoo

Bing -> other-bing

Facebook -> other-facebook

Twitter -> other-twitter

anything else -> other-other

// Load the raw dataset stored as a CSV file
val clickstream = sqlContext
    .read
    .format("com.databricks.spark.csv")
    .options(Map("header" -> "true", "delimiter" -> "\t", "mode" -> "PERMISSIVE", "inferSchema" -> "true"))
    .load("dbfs:///databricks-datasets/wikipedia-datasets/data-001/clickstream/raw-uncompressed")

clickstream: org.apache.spark.sql.DataFrame = [prev_id: int, curr_id: int ... 4 more fields]

clickstream.printSchema

display(clickstream)


null	3632887	121	other-google	!!	other
null	3632887	93	other-wikipedia	!!	other
null	3632887	46	other-empty	!!	other
null	3632887	10	other-other	!!	other
64486	3632887	11	!_(disambiguation)	!!	other
2061699	2556962	19	Louden_Up_Now	!!!_(album)	link
null	2556962	25	other-empty	!!!_(album)	other
null	2556962	16	other-google	!!!_(album)	other
null	2556962	44	other-wikipedia	!!!_(album)	other
64486	2556962	15	!_(disambiguation)	!!!_(album)	link
600744	2556962	297	!!!	!!!_(album)	link
null	6893310	11	other-empty	!Hero_(album)	other
1921683	6893310	26	!Hero	!Hero_(album)	link
null	6893310	16	other-wikipedia	!Hero_(album)	other
null	6893310	23	other-google	!Hero_(album)	other
8127304	22602473	16	Jericho_Rosales	!Oka_Tokat	link
35978874	22602473	20	List_of_telenovelas_of_ABS-CBN	!Oka_Tokat	link
null	22602473	57	other-google	!Oka_Tokat	other
null	22602473	12	other-wikipedia	!Oka_Tokat	other
null	22602473	23	other-empty	!Oka_Tokat	other
7360687	22602473	10	Rica_Peralejo	!Oka_Tokat	link
37104582	22602473	11	Jeepney_TV	!Oka_Tokat	link
34376590	22602473	22	Oka_Tokat_(2012_TV_series)	!Oka_Tokat	link
null	6810768	20	other-wikipedia	!T.O.O.H.!	other
null	6810768	81	other-google	!T.O.O.H.!	other
31976181	6810768	51	List_of_death_metal_bands,_!–K	!T.O.O.H.!	link
null	6810768	35	other-empty	!T.O.O.H.!	other
null	3243047	21	other-empty	!_(album)	other
1337475	3243047	208	The_Dismemberment_Plan	!_(album)	link
3284285	3243047	78	The_Dismemberment_Plan_Is_Terrified	!_(album)	link
null	3243047	28	other-wikipedia	!_(album)	other
2098292	899480	58	United_States_military_award_devices	"A"_Device	link
194844	899480	15	USS_Yorktown_(CV-5)	"A"_Device	link
null	899480	17	other-google	"A"_Device	other
null	899480	13	other-empty	"A"_Device	other
null	899480	29	other-wikipedia	"A"_Device	other
878246	899480	11	American_Defense_Service_Medal	"A"_Device	link
855901	899480	24	Overseas_Service_Ribbon	"A"_Device	other
206427	899480	33	USS_Ranger_(CV-4)	"A"_Device	link
773691	899480	47	Antarctica_Service_Medal	"A"_Device	link
2301720	1282996	43	Kinsey_Millhone	"A"_Is_for_Alibi	link
null	1282996	45	other-empty	"A"_Is_for_Alibi	other
null	1282996	10	other-yahoo	"A"_Is_for_Alibi	other
470006	1282996	207	Sue_Grafton	"A"_Is_for_Alibi	link
null	1282996	18	other-other	"A"_Is_for_Alibi	other
null	1282996	31	other-wikipedia	"A"_Is_for_Alibi	other
null	1282996	272	other-google	"A"_Is_for_Alibi	other
39606873	1282996	10	"W"_Is_for_Wasted	"A"_Is_for_Alibi	link
26181056	9003666	17	And	"And"_theory_of_conservatism	link
null	9003666	109	other-wikipedia	"And"_theory_of_conservatism	other
null	9003666	18	other-google	"And"_theory_of_conservatism	other
null	39072529	49	other-google	"Bassy"_Bob_Brockmann	other
null	39072529	10	other-other	"Bassy"_Bob_Brockmann	other
11273993	null	15	Colt_1851_Navy_Revolver	"Bigfoot"_Wallace	redlink
12571133	25033979	12	"V"_Is_for_Vagina	"C"_is_for_(Please_Insert_Sophomoric_Genitalia_Reference_HERE)	link
113468	25033979	24	The_Mission	"C"_is_for_(Please_Insert_Sophomoric_Genitalia_Reference_HERE)	link
14096078	25033979	15	Trent_Reznor_discography	"C"_is_for_(Please_Insert_Sophomoric_Genitalia_Reference_HERE)	other
null	25033979	42	other-empty	"C"_is_for_(Please_Insert_Sophomoric_Genitalia_Reference_HERE)	other
1375614	25033979	15	Tapeworm_(band)	"C"_is_for_(Please_Insert_Sophomoric_Genitalia_Reference_HERE)	other
159547	25033979	25	Milla_Jovovich	"C"_is_for_(Please_Insert_Sophomoric_Genitalia_Reference_HERE)	other
28639397	25033979	73	Sound_into_Blood_into_Wine	"C"_is_for_(Please_Insert_Sophomoric_Genitalia_Reference_HERE)	link
1893465	25033979	30	Carina_Round	"C"_is_for_(Please_Insert_Sophomoric_Genitalia_Reference_HERE)	other
33622887	25033979	10	Conditions_of_My_Parole	"C"_is_for_(Please_Insert_Sophomoric_Genitalia_Reference_HERE)	link
147692	25033979	25	Tim_Alexander	"C"_is_for_(Please_Insert_Sophomoric_Genitalia_Reference_HERE)	other
4619790	25033979	593	Puscifer	"C"_is_for_(Please_Insert_Sophomoric_Genitalia_Reference_HERE)	link
null	25033979	36	other-wikipedia	"C"_is_for_(Please_Insert_Sophomoric_Genitalia_Reference_HERE)	other
null	25033979	93	other-google	"C"_is_for_(Please_Insert_Sophomoric_Genitalia_Reference_HERE)	other
69161	null	51	Tết	"Chúc_Mừng_Năm_Mới"_or_best_wishes_for_the_new_year.	redlink
1438509	null	14	List_of_Old_West_gunfighters	"Cool_Hand_Conor"_O'Neill	redlink
null	331586	6820	other-google	"Crocodile"_Dundee	other
null	331586	20	other-twitter	"Crocodile"_Dundee	other
null	331586	781	other-wikipedia	"Crocodile"_Dundee	other
489033	331586	59	List_of_Academy_Awards_ceremonies	"Crocodile"_Dundee	link
10040606	331586	38	List_of_Australian_films	"Crocodile"_Dundee	other
2564144	331586	154	Crocodile_Dundee_in_Los_Angeles	"Crocodile"_Dundee	link
6127928	331586	14	Bobby_Alto	"Crocodile"_Dundee	other
152171	331586	13	Baz_Luhrmann	"Crocodile"_Dundee	link
8078282	331586	348	Australia_(2008_film)	"Crocodile"_Dundee	link
37386608	331586	66	2015_in_film	"Crocodile"_Dundee	link
34557	331586	12	1980s	"Crocodile"_Dundee	other
1118809	331586	297	"Crocodile"_Dundee_II	"Crocodile"_Dundee	link
7033	331586	52	Caitlin_Clarke	"Crocodile"_Dundee	other
72766	331586	31	Dundee_(disambiguation)	"Crocodile"_Dundee	other
171612	331586	221	1986_in_film	"Crocodile"_Dundee	link
2376452	331586	34	Australian_New_Wave	"Crocodile"_Dundee	other
1248074	331586	60	David_Gulpilil	"Crocodile"_Dundee	link
865241	331586	10	Crocodile_Hunter	"Crocodile"_Dundee	other
196020	331586	12	Crocodilia	"Crocodile"_Dundee	link
643649	331586	85	List_of_most_watched_television_broadcasts	"Crocodile"_Dundee	link
8306521	331586	13	Anne_Carlisle	"Crocodile"_Dundee	other
1448969	331586	18	Bart_vs._Australia	"Crocodile"_Dundee	other
70209	331586	153	Cinema_of_Australia	"Crocodile"_Dundee	link
4008173	331586	18	59th_Academy_Awards	"Crocodile"_Dundee	link
331460	331586	17	Bowie_knife	"Crocodile"_Dundee	link
37882	331586	21	Crocodile	"Crocodile"_Dundee	other
44789934	331586	1283	Deaths_in_2015	"Crocodile"_Dundee	link
22344579	331586	30	Academy_Award_for_Best_Original_Screenplay	"Crocodile"_Dundee	link
1872502	331586	10	Boy-Scoutz_'n_the_Hood	"Crocodile"_Dundee	other
5644	331586	13	Comedy_film	"Crocodile"_Dundee	link
458340	331586	10	List_of_films_set_in_New_York_City	"Crocodile"_Dundee	other

Showing the first 1000 rows.

clickstream.show(5)

+-------+-------+---+------------------+----------+-----+ |prev_id|curr_id| n| prev_title|curr_title| type| +-------+-------+---+------------------+----------+-----+ | null|3632887|121| other-google| !!|other| | null|3632887| 93| other-wikipedia| !!|other| | null|3632887| 46| other-empty| !!|other| | null|3632887| 10| other-other| !!|other| | 64486|3632887| 11|!_(disambiguation)| !!|other| +-------+-------+---+------------------+----------+-----+ only showing top 5 rows

clickstream.cache().count()

res9: Long = 22509897

clickstream.count()

res10: Long = 22509897

//Type in your answer here...
display(clickstream
  .select(clickstream("curr_title"), clickstream("n"))
  .groupBy("curr_title")
  .sum()
  .orderBy($"sum(n)".desc)
  .limit(10))


Main_Page	127500620
87th_Academy_Awards	2559794
Fifty_Shades_of_Grey	2326175
Alive	2244781
Chris_Kyle	1709341
Fifty_Shades_of_Grey_(film)	1683892
Deaths_in_2015	1614577
Birdman_(film)	1545842
Islamic_State_of_Iraq_and_the_Levant	1406530
Stephen_Hawking	1384193

display(clickstream
  .select(clickstream("prev_title"), clickstream("n"))
  .groupBy("prev_title")
  .sum()
  .orderBy($"sum(n)".desc)
  .limit(10))


other-google	1496209976
other-empty	347693595
other-wikipedia	129772279
other-other	77569671
other-bing	65962792
other-yahoo	48501171
Main_Page	29923502
other-twitter	19241298
other-facebook	2314026
87th_Academy_Awards	1680675

//Type in your answer here...
display(clickstream
  .select(clickstream("curr_title"), clickstream("prev_title"), clickstream("n"))
  .filter("prev_title = 'other-twitter'")
  .groupBy("curr_title")
  .sum()
  .orderBy($"sum(n)".desc)
  .limit(5))


Johnny_Knoxville	198908
Peter_Woodcock	126259
2002_Tampa_plane_crash	119906
Sơn_Đoòng_Cave	116012
The_boy_Jones	114401

val allClicks = clickstream.selectExpr("sum(n)").first.getLong(0)
val referals = clickstream.
                filter(clickstream("prev_id").isNotNull).
                selectExpr("sum(n)").first.getLong(0)
(referals * 100.0) / allClicks

allClicks: Long = 3283067885 referals: Long = 1095462001 res18: Double = 33.36702253416853

clickstream.createOrReplaceTempView("clicks")

%sql   
SELECT *
FROM clicks
WHERE 
  curr_title = 'Donald_Trump' AND
  prev_id IS NOT NULL AND prev_title != 'Main_Page'
ORDER BY n DESC
LIMIT 20


1861441	4848272	4658	Ivanka_Trump	Donald_Trump	link
4848272	4848272	2212	Donald_Trump	Donald_Trump	link
1209075	4848272	1855	Melania_Trump	Donald_Trump	link
1057887	4848272	1760	Ivana_Trump	Donald_Trump	link
5679119	4848272	1074	Donald_Trump_Jr.	Donald_Trump	link
21377251	4848272	918	United_States_presidential_election,_2016	Donald_Trump	link
8095589	4848272	728	Eric_Trump	Donald_Trump	link
473806	4848272	652	Marla_Maples	Donald_Trump	link
2565136	4848272	651	The_Trump_Organization	Donald_Trump	link
9917693	4848272	599	The_Celebrity_Apprentice	Donald_Trump	link
9289480	4848272	597	The_Apprentice_(U.S._TV_series)	Donald_Trump	link
290327	4848272	596	German_American	Donald_Trump	link
12643497	4848272	585	Comedy_Central_Roast	Donald_Trump	link
37643999	4848272	549	Republican_Party_presidential_candidates,_2016	Donald_Trump	link
417559	4848272	543	Alan_Sugar	Donald_Trump	link
1203316	4848272	489	Fred_Trump	Donald_Trump	link
303951	4848272	426	Vince_McMahon	Donald_Trump	link
6191053	4848272	413	Jared_Kushner	Donald_Trump	link
1295216	4848272	412	Trump_Tower_(New_York_City)	Donald_Trump	link
6509278	4848272	402	Trump	Donald_Trump	link

%sql   
-- YouTry 
---
-- fill in the right sql query here

package d3ivan
// We use a package object so that we can define top level classes like Edge that need to be used in other cells

import org.apache.spark.sql._
import com.databricks.backend.daemon.driver.EnhancedRDDFunctions.displayHTML

case class Edge(src: String, dest: String, count: Long)

case class Node(name: String)
case class Link(source: Int, target: Int, value: Long)
case class Graph(nodes: Seq[Node], links: Seq[Link])

object graphs {
// val sqlContext = SQLContext.getOrCreate(org.apache.spark.SparkContext.getOrCreate())  /// fix
val sqlContext = SparkSession.builder().getOrCreate().sqlContext
import sqlContext.implicits._
  
def force(clicks: Dataset[Edge], height: Int = 100, width: Int = 960): Unit = {
  val data = clicks.collect()
  val nodes = (data.map(_.src) ++ data.map(_.dest)).map(_.replaceAll("_", " ")).toSet.toSeq.map(Node)
  val links = data.map { t =>
    Link(nodes.indexWhere(_.name == t.src.replaceAll("_", " ")), nodes.indexWhere(_.name == t.dest.replaceAll("_", " ")), t.count / 20 + 1)
  }
  showGraph(height, width, Seq(Graph(nodes, links)).toDF().toJSON.first())
}

/**
 * Displays a force directed graph using d3
 * input: {"nodes": [{"name": "..."}], "links": [{"source": 1, "target": 2, "value": 0}]}
 */
def showGraph(height: Int, width: Int, graph: String): Unit = {

displayHTML(s"""
<style>

.node_circle {
  stroke: #777;
  stroke-width: 1.3px;
}

.node_label {
  pointer-events: none;
}

.link {
  stroke: #777;
  stroke-opacity: .2;
}

.node_count {
  stroke: #777;
  stroke-width: 1.0px;
  fill: #999;
}

text.legend {
  font-family: Verdana;
  font-size: 13px;
  fill: #000;
}

.node text {
  font-family: "Helvetica Neue","Helvetica","Arial",sans-serif;
  font-size: 17px;
  font-weight: 200;
}

</style>

<div id="clicks-graph">
<script src="//d3js.org/d3.v3.min.js"></script>
<script>

var graph = $graph;

var width = $width,
    height = $height;

var color = d3.scale.category20();

var force = d3.layout.force()
    .charge(-700)
    .linkDistance(180)
    .size([width, height]);

var svg = d3.select("#clicks-graph").append("svg")
    .attr("width", width)
    .attr("height", height);
    
force
    .nodes(graph.nodes)
    .links(graph.links)
    .start();

var link = svg.selectAll(".link")
    .data(graph.links)
    .enter().append("line")
    .attr("class", "link")
    .style("stroke-width", function(d) { return Math.sqrt(d.value); });

var node = svg.selectAll(".node")
    .data(graph.nodes)
    .enter().append("g")
    .attr("class", "node")
    .call(force.drag);

node.append("circle")
    .attr("r", 10)
    .style("fill", function (d) {
    if (d.name.startsWith("other")) { return color(1); } else { return color(2); };
})

node.append("text")
      .attr("dx", 10)
      .attr("dy", ".35em")
      .text(function(d) { return d.name });
      
//Now we are giving the SVGs co-ordinates - the force layout is generating the co-ordinates which this code is using to update the attributes of the SVG elements
force.on("tick", function () {
    link.attr("x1", function (d) {
        return d.source.x;
    })
        .attr("y1", function (d) {
        return d.source.y;
    })
        .attr("x2", function (d) {
        return d.target.x;
    })
        .attr("y2", function (d) {
        return d.target.y;
    });
    d3.selectAll("circle").attr("cx", function (d) {
        return d.x;
    })
        .attr("cy", function (d) {
        return d.y;
    });
    d3.selectAll("text").attr("x", function (d) {
        return d.x;
    })
        .attr("y", function (d) {
        return d.y;
    });
});
</script>
</div>
""")
}
  
  def help() = {
displayHTML("""
<p>
Produces a force-directed graph given a collection of edges of the following form:</br>
<tt><font color="#a71d5d">case class</font> <font color="#795da3">Edge</font>(<font color="#ed6a43">src</font>: <font color="#a71d5d">String</font>, <font color="#ed6a43">dest</font>: <font color="#a71d5d">String</font>, <font color="#ed6a43">count</font>: <font color="#a71d5d">Long</font>)</tt>
</p>
<p>Usage:<br/>
<tt><font color="#a71d5d">import</font> <font color="#ed6a43">d3._</font></tt><br/>
<tt><font color="#795da3">graphs.force</font>(</br>
&nbsp;&nbsp;<font color="#ed6a43">height</font> = <font color="#795da3">500</font>,<br/>
&nbsp;&nbsp;<font color="#ed6a43">width</font> = <font color="#795da3">500</font>,<br/>
&nbsp;&nbsp;<font color="#ed6a43">clicks</font>: <font color="#795da3">Dataset</font>[<font color="#795da3">Edge</font>])</tt>
</p>""")
  }
}

Show code

Warning: classes defined within packages cannot be redefined without a cluster restart. Compilation successful.

d3ivan.graphs.help()

d3ivan.graphs.force(
  height = 800,
  width = 800,
  clicks = sql("""
    SELECT 
      prev_title AS src,
      curr_title AS dest,
      n AS count FROM clicks
    WHERE 
      curr_title IN ('Donald_Trump', 'Bernie_Sanders', 'Hillary_Rodham_Clinton', 'Ted_Cruz') AND
      prev_id IS NOT NULL AND prev_title != 'Main_Page'
    ORDER BY n DESC
    LIMIT 20""").as[d3ivan.Edge])

Convert raw data to parquet

Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language. It is a more efficient way to store data frames.

To understand the ideas read Dremel: Interactive Analysis of Web-Scale Datasets, Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton and Theo Vassilakis,Proc. of the 36th Int'l Conf on Very Large Data Bases (2010), pp. 330-339, whose Abstract is as follows:
- Dremel is a scalable, interactive ad-hoc query system for analysis of read-only nested data. By combining multi-level execution trees and columnar data layouts it is capable of running aggregation queries over trillion-row tables in seconds. The system scales to thousands of CPUs and petabytes of data, and has thousands of users at Google. In this paper, we describe the architecture and implementation of Dremel, and explain how it complements MapReduce-based computing. We present a novel columnar storage representation for nested records and discuss experiments on few-thousand node instances of the system.

Show code

// Convert the DatFrame to a more efficent format to speed up our analysis
clickstream.
  write.
  mode(SaveMode.Overwrite).
  parquet("/datasets/wiki-clickstream")

val clicks = sqlContext.read.parquet("/datasets/wiki-clickstream")

clicks: org.apache.spark.sql.DataFrame = [prev_id: int, curr_id: int ... 4 more fields]

clicks.printSchema

display(clicks)  // let's display this DataFrame


154849	7851	104	Strategic_Arms_Limitation_Talks	Comprehensive_Nuclear-Test-Ban_Treaty	link
null	7851	100	other-bing	Comprehensive_Nuclear-Test-Ban_Treaty	other
null	7851	82	other-other	Comprehensive_Nuclear-Test-Ban_Treaty	other
15580374	7851	31	Main_Page	Comprehensive_Nuclear-Test-Ban_Treaty	other
1455516	7851	21	Mercury,_Nevada	Comprehensive_Nuclear-Test-Ban_Treaty	link
30592	7851	158	Partial_Nuclear_Test_Ban_Treaty	Comprehensive_Nuclear-Test-Ban_Treaty	link
21785	7851	52	Nuclear_weapon	Comprehensive_Nuclear-Test-Ban_Treaty	link
1786856	7851	25	Onyx_River	Comprehensive_Nuclear-Test-Ban_Treaty	other
337775	7851	92	Nuclear_weapons_testing	Comprehensive_Nuclear-Test-Ban_Treaty	link
22158	7851	12	Nuclear_proliferation	Comprehensive_Nuclear-Test-Ban_Treaty	link
594209	7851	13	Moruroa	Comprehensive_Nuclear-Test-Ban_Treaty	other
14604	7851	12	Foreign_relations_of_India	Comprehensive_Nuclear-Test-Ban_Treaty	other
499076	7851	35	Force_de_dissuasion	Comprehensive_Nuclear-Test-Ban_Treaty	link
1838300	7851	10	Fissile_Material_Cut-off_Treaty	Comprehensive_Nuclear-Test-Ban_Treaty	other
6174686	7851	23	India–United_States_Civil_Nuclear_Agreement	Comprehensive_Nuclear-Test-Ban_Treaty	other
3973438	7851	47	High-altitude_nuclear_explosion	Comprehensive_Nuclear-Test-Ban_Treaty	link
2962287	7851	48	Comprehensive_Nuclear-Test-Ban_Treaty_Organization	Comprehensive_Nuclear-Test-Ban_Treaty	link
9486	7851	58	List_of_international_environmental_agreements	Comprehensive_Nuclear-Test-Ban_Treaty	other
17671170	7851	14	List_of_weapons_of_mass_destruction_treaties	Comprehensive_Nuclear-Test-Ban_Treaty	link
41535782	7851	20	List_of_nuclear_weapons_tests_of_the_Soviet_Union	Comprehensive_Nuclear-Test-Ban_Treaty	other
null	7851	4143	other-google	Comprehensive_Nuclear-Test-Ban_Treaty	other
null	7851	218	other-wikipedia	Comprehensive_Nuclear-Test-Ban_Treaty	other
49750	7851	22	Arms_control	Comprehensive_Nuclear-Test-Ban_Treaty	other
34636	7851	12	1996	Comprehensive_Nuclear-Test-Ban_Treaty	link
589108	7851	43	China_and_weapons_of_mass_destruction	Comprehensive_Nuclear-Test-Ban_Treaty	other
585476	7851	13	Atomic_Weapons_Establishment	Comprehensive_Nuclear-Test-Ban_Treaty	link
400821	7851	524	List_of_states_with_nuclear_weapons	Comprehensive_Nuclear-Test-Ban_Treaty	link
704801	7851	13	List_of_United_States_treaties	Comprehensive_Nuclear-Test-Ban_Treaty	other
2189647	7851	40	List_of_nuclear_weapons_tests	Comprehensive_Nuclear-Test-Ban_Treaty	other
14533	7851	47	India	Comprehensive_Nuclear-Test-Ban_Treaty	link
10737	7851	11	French_Polynesia	Comprehensive_Nuclear-Test-Ban_Treaty	other
38404161	7851	13	Historical_nuclear_weapons_stockpiles_and_nuclear_tests_by_country	Comprehensive_Nuclear-Test-Ban_Treaty	other
215176	7851	10	Infrasound	Comprehensive_Nuclear-Test-Ban_Treaty	other
589091	7851	18	France_and_weapons_of_mass_destruction	Comprehensive_Nuclear-Test-Ban_Treaty	other
740008	7851	46	India_and_weapons_of_mass_destruction	Comprehensive_Nuclear-Test-Ban_Treaty	other
31446831	7851	17	Preparatory_Commission_for_the_Comprehensive_Nuclear-Test-Ban_Treaty_Organization	Comprehensive_Nuclear-Test-Ban_Treaty	link
162759	7851	11	Nevada_Test_Site	Comprehensive_Nuclear-Test-Ban_Treaty	other
53366	7851	11	Reconnaissance_satellite	Comprehensive_Nuclear-Test-Ban_Treaty	other
22165	7851	20	Nuclear_disarmament	Comprehensive_Nuclear-Test-Ban_Treaty	link
2824536	7851	40	Peaceful_nuclear_explosion	Comprehensive_Nuclear-Test-Ban_Treaty	link
null	7851	504	other-empty	Comprehensive_Nuclear-Test-Ban_Treaty	other
3003272	7851	10	Threshold_Test_Ban_Treaty	Comprehensive_Nuclear-Test-Ban_Treaty	link
53136	7851	20	Weapon_of_mass_destruction	Comprehensive_Nuclear-Test-Ban_Treaty	other
589015	7851	15	United_States_and_weapons_of_mass_destruction	Comprehensive_Nuclear-Test-Ban_Treaty	other
null	2962287	69	other-empty	Comprehensive_Nuclear-Test-Ban_Treaty_Organization	other
1454371	2962287	13	Robinson_Crusoe_Island	Comprehensive_Nuclear-Test-Ban_Treaty_Organization	link
31446831	2962287	20	Preparatory_Commission_for_the_Comprehensive_Nuclear-Test-Ban_Treaty_Organization	Comprehensive_Nuclear-Test-Ban_Treaty_Organization	link
30669	2962287	10	Tunguska_event	Comprehensive_Nuclear-Test-Ban_Treaty_Organization	link
null	2962287	19	other-other	Comprehensive_Nuclear-Test-Ban_Treaty_Organization	other
18507109	2962287	25	List_of_intergovernmental_organizations	Comprehensive_Nuclear-Test-Ban_Treaty_Organization	link
24644042	2962287	15	List_of_United_Nations_Organizations	Comprehensive_Nuclear-Test-Ban_Treaty_Organization	link
41529235	2962287	12	2014_AA	Comprehensive_Nuclear-Test-Ban_Treaty_Organization	link
7851	2962287	44	Comprehensive_Nuclear-Test-Ban_Treaty	Comprehensive_Nuclear-Test-Ban_Treaty_Organization	link
null	2962287	45	other-wikipedia	Comprehensive_Nuclear-Test-Ban_Treaty_Organization	other
null	2962287	364	other-google	Comprehensive_Nuclear-Test-Ban_Treaty_Organization	other
null	24042157	134	other-google	Comprehensive_Peace_Accord	other
369826	24042157	145	Nepalese_Civil_War	Comprehensive_Peace_Accord	link
null	24042157	28	other-empty	Comprehensive_Peace_Accord	other
null	14712754	135	other-empty	Comprehensive_Peace_Agreement	other
32104474	14712754	12	Sudanese_conflict_in_South_Kordofan_and_Blue_Nile	Comprehensive_Peace_Agreement	link
13265409	14712754	47	South_Sudanese_independence_referendum,_2011	Comprehensive_Peace_Agreement	link
1131537	14712754	90	Second_Sudanese_Civil_War	Comprehensive_Peace_Agreement	link
3758554	14712754	17	Lost_Boys_of_Sudan	Comprehensive_Peace_Agreement	link
null	14712754	565	other-google	Comprehensive_Peace_Agreement	other
null	14712754	27	other-wikipedia	Comprehensive_Peace_Agreement	other
null	14712754	12	other-yahoo	Comprehensive_Peace_Agreement	other
13885196	14712754	15	Abyei	Comprehensive_Peace_Agreement	link
null	14712754	49	other-other	Comprehensive_Peace_Agreement	other
null	14712754	14	other-bing	Comprehensive_Peace_Agreement	other
32350676	14712754	56	South_Sudan	Comprehensive_Peace_Agreement	link
1805468	14712754	12	United_Nations_Mission_in_Sudan	Comprehensive_Peace_Agreement	link
27421	14712754	55	Sudan	Comprehensive_Peace_Agreement	link
null	6516998	39	other-google	Comprehensive_Performance_Assessment	other
null	40004458	17	other-google	Comprehensive_Physiology	other
null	426741	22	other-google	Comprehensive_Program_for_Socialist_Economic_Integration	other
384307	426741	31	Comecon	Comprehensive_Program_for_Socialist_Economic_Integration	link
2733145	null	20	Kano_model	Comprehensive_QFD	redlink
8053588	8935636	13	Food_safety	Comprehensive_Reviews_in_Food_Science_and_Food_Safety	link
264062	8935636	23	Kombucha	Comprehensive_Reviews_in_Food_Science_and_Food_Safety	link
null	8935636	13	other-empty	Comprehensive_Reviews_in_Food_Science_and_Food_Safety	other
null	8935636	23	other-google	Comprehensive_Reviews_in_Food_Science_and_Food_Safety	other
null	29639511	168	other-google	Comprehensive_Rural_Health_Project	other
null	29639511	13	other-empty	Comprehensive_Rural_Health_Project	other
null	41389730	84	other-google	Comprehensive_Social_Security_Assistance	other
null	41389730	33	other-wikipedia	Comprehensive_Social_Security_Assistance	other
null	41389730	16	other-empty	Comprehensive_Social_Security_Assistance	other
4590784	null	16	Educational_Records_Bureau	Comprehensive_Testing_Program	redlink
41176535	42015113	69	Geneva_interim_agreement_on_the_Iranian_nuclear_program	Comprehensive_agreement_on_Iranian_nuclear_program	link
310477	42015113	13	Iran–United_States_relations	Comprehensive_agreement_on_Iranian_nuclear_program	link
24192202	42015113	144	P5+1	Comprehensive_agreement_on_Iranian_nuclear_program	link
null	42015113	37	other-other	Comprehensive_agreement_on_Iranian_nuclear_program	other
null	42015113	325	other-empty	Comprehensive_agreement_on_Iranian_nuclear_program	other
10340960	42015113	10	Timeline_of_the_nuclear_program_of_Iran	Comprehensive_agreement_on_Iranian_nuclear_program	link
721807	42015113	94	Nuclear_program_of_Iran	Comprehensive_agreement_on_Iranian_nuclear_program	link
14653	42015113	96	Iran	Comprehensive_agreement_on_Iranian_nuclear_program	link
null	42015113	108	other-wikipedia	Comprehensive_agreement_on_Iranian_nuclear_program	other
null	42015113	2299	other-google	Comprehensive_agreement_on_Iranian_nuclear_program	other
null	4832436	121	other-google	Comprehensive_emergency_management	other
null	4832436	11	other-bing	Comprehensive_emergency_management	other
4831178	4832436	24	United_States_civil_defense	Comprehensive_emergency_management	other

Showing the first 1000 rows.

%py
clicksPy = sqlContext.read.parquet("/datasets/wiki-clickstream")

%py
# in Python you need to put the object int its own line like this to get the type information
clicksPy

Out[3]: DataFrame[prev_id: int, curr_id: int, n: int, prev_title: string, curr_title: string, type: string]

%py
clicksPy.show()

+--------+-------+---+--------------------+--------------------+-----+ | prev_id|curr_id| n| prev_title| curr_title| type| +--------+-------+---+--------------------+--------------------+-----+ | 154849| 7851|104|Strategic_Arms_Li...|Comprehensive_Nuc...| link| | null| 7851|100| other-bing|Comprehensive_Nuc...|other| | null| 7851| 82| other-other|Comprehensive_Nuc...|other| |15580374| 7851| 31| Main_Page|Comprehensive_Nuc...|other| | 1455516| 7851| 21| Mercury,_Nevada|Comprehensive_Nuc...| link| | 30592| 7851|158|Partial_Nuclear_T...|Comprehensive_Nuc...| link| | 21785| 7851| 52| Nuclear_weapon|Comprehensive_Nuc...| link| | 1786856| 7851| 25| Onyx_River|Comprehensive_Nuc...|other| | 337775| 7851| 92|Nuclear_weapons_t...|Comprehensive_Nuc...| link| | 22158| 7851| 12|Nuclear_prolifera...|Comprehensive_Nuc...| link| | 594209| 7851| 13| Moruroa|Comprehensive_Nuc...|other| | 14604| 7851| 12|Foreign_relations...|Comprehensive_Nuc...|other| | 499076| 7851| 35| Force_de_dissuasion|Comprehensive_Nuc...| link| | 1838300| 7851| 10|Fissile_Material_...|Comprehensive_Nuc...|other| | 6174686| 7851| 23|India–United_Stat...|Comprehensive_Nuc...|other| | 3973438| 7851| 47|High-altitude_nuc...|Comprehensive_Nuc...| link| | 2962287| 7851| 48|Comprehensive_Nuc...|Comprehensive_Nuc...| link| | 9486| 7851| 58|List_of_internati...|Comprehensive_Nuc...|other| |17671170| 7851| 14|List_of_weapons_o...|Comprehensive_Nuc...| link| |41535782| 7851| 20|List_of_nuclear_w...|Comprehensive_Nuc...|other| +--------+-------+---+--------------------+--------------------+-----+ only showing top 20 rows

%r
library(SparkR)

# just a quick test
df <- createDataFrame(faithful)
head(df)

Attaching package: ‘SparkR’

The following object is masked _by_ ‘.GlobalEnv’:

    setLocalProperty

The following objects are masked from ‘package:stats’:

    cov, filter, lag, na.omit, predict, sd, var, window

The following objects are masked from ‘package:base’:

    as.data.frame, colnames, colnames<-, drop, endsWith, intersect,
    rank, rbind, sample, startsWith, subset, summary, transform, union

  eruptions waiting
1     3.600      79
2     1.800      54
3     3.333      74
4     2.283      62
5     4.533      85
6     2.883      55

%r
# Read in the Parquet file created above. Parquet files are self-describing so the schema is preserved.
# The result of loading a parquet file is also a DataFrame.
clicksR <- read.df("/datasets/wiki-clickstream", source = "parquet")
clicksR # in R you need to put the object int its own line like this to get the type information

SparkDataFrame[prev_id:int, curr_id:int, n:int, prev_title:string, curr_title:string, type:string]

%r
head(clicksR)

   prev_id curr_id  n               prev_title curr_title  type
1   334751   19271 24                 Cambodia   Mongolia  link
2      737   19271 24              Afghanistan   Mongolia other
3 18603746   19271 13                  Beijing   Mongolia  link
4  7770444   19271 10  Agriculture_in_Mongolia   Mongolia  link
5  7712057   19271 12 Christianity_in_Mongolia   Mongolia  link
6 16489766   19271 11      Cities_of_East_Asia   Mongolia  link

%r
display(clicksR)


154849	7851	104	Strategic_Arms_Limitation_Talks	Comprehensive_Nuclear-Test-Ban_Treaty	link
null	7851	100	other-bing	Comprehensive_Nuclear-Test-Ban_Treaty	other
null	7851	82	other-other	Comprehensive_Nuclear-Test-Ban_Treaty	other
15580374	7851	31	Main_Page	Comprehensive_Nuclear-Test-Ban_Treaty	other
1455516	7851	21	Mercury,_Nevada	Comprehensive_Nuclear-Test-Ban_Treaty	link
30592	7851	158	Partial_Nuclear_Test_Ban_Treaty	Comprehensive_Nuclear-Test-Ban_Treaty	link
21785	7851	52	Nuclear_weapon	Comprehensive_Nuclear-Test-Ban_Treaty	link
1786856	7851	25	Onyx_River	Comprehensive_Nuclear-Test-Ban_Treaty	other
337775	7851	92	Nuclear_weapons_testing	Comprehensive_Nuclear-Test-Ban_Treaty	link
22158	7851	12	Nuclear_proliferation	Comprehensive_Nuclear-Test-Ban_Treaty	link
594209	7851	13	Moruroa	Comprehensive_Nuclear-Test-Ban_Treaty	other
14604	7851	12	Foreign_relations_of_India	Comprehensive_Nuclear-Test-Ban_Treaty	other
499076	7851	35	Force_de_dissuasion	Comprehensive_Nuclear-Test-Ban_Treaty	link
1838300	7851	10	Fissile_Material_Cut-off_Treaty	Comprehensive_Nuclear-Test-Ban_Treaty	other
6174686	7851	23	India–United_States_Civil_Nuclear_Agreement	Comprehensive_Nuclear-Test-Ban_Treaty	other
3973438	7851	47	High-altitude_nuclear_explosion	Comprehensive_Nuclear-Test-Ban_Treaty	link
2962287	7851	48	Comprehensive_Nuclear-Test-Ban_Treaty_Organization	Comprehensive_Nuclear-Test-Ban_Treaty	link
9486	7851	58	List_of_international_environmental_agreements	Comprehensive_Nuclear-Test-Ban_Treaty	other
17671170	7851	14	List_of_weapons_of_mass_destruction_treaties	Comprehensive_Nuclear-Test-Ban_Treaty	link
41535782	7851	20	List_of_nuclear_weapons_tests_of_the_Soviet_Union	Comprehensive_Nuclear-Test-Ban_Treaty	other
null	7851	4143	other-google	Comprehensive_Nuclear-Test-Ban_Treaty	other
null	7851	218	other-wikipedia	Comprehensive_Nuclear-Test-Ban_Treaty	other
49750	7851	22	Arms_control	Comprehensive_Nuclear-Test-Ban_Treaty	other
34636	7851	12	1996	Comprehensive_Nuclear-Test-Ban_Treaty	link
589108	7851	43	China_and_weapons_of_mass_destruction	Comprehensive_Nuclear-Test-Ban_Treaty	other
585476	7851	13	Atomic_Weapons_Establishment	Comprehensive_Nuclear-Test-Ban_Treaty	link
400821	7851	524	List_of_states_with_nuclear_weapons	Comprehensive_Nuclear-Test-Ban_Treaty	link
704801	7851	13	List_of_United_States_treaties	Comprehensive_Nuclear-Test-Ban_Treaty	other
2189647	7851	40	List_of_nuclear_weapons_tests	Comprehensive_Nuclear-Test-Ban_Treaty	other
14533	7851	47	India	Comprehensive_Nuclear-Test-Ban_Treaty	link
10737	7851	11	French_Polynesia	Comprehensive_Nuclear-Test-Ban_Treaty	other
38404161	7851	13	Historical_nuclear_weapons_stockpiles_and_nuclear_tests_by_country	Comprehensive_Nuclear-Test-Ban_Treaty	other
215176	7851	10	Infrasound	Comprehensive_Nuclear-Test-Ban_Treaty	other
589091	7851	18	France_and_weapons_of_mass_destruction	Comprehensive_Nuclear-Test-Ban_Treaty	other
740008	7851	46	India_and_weapons_of_mass_destruction	Comprehensive_Nuclear-Test-Ban_Treaty	other
31446831	7851	17	Preparatory_Commission_for_the_Comprehensive_Nuclear-Test-Ban_Treaty_Organization	Comprehensive_Nuclear-Test-Ban_Treaty	link
162759	7851	11	Nevada_Test_Site	Comprehensive_Nuclear-Test-Ban_Treaty	other
53366	7851	11	Reconnaissance_satellite	Comprehensive_Nuclear-Test-Ban_Treaty	other
22165	7851	20	Nuclear_disarmament	Comprehensive_Nuclear-Test-Ban_Treaty	link
2824536	7851	40	Peaceful_nuclear_explosion	Comprehensive_Nuclear-Test-Ban_Treaty	link
null	7851	504	other-empty	Comprehensive_Nuclear-Test-Ban_Treaty	other
3003272	7851	10	Threshold_Test_Ban_Treaty	Comprehensive_Nuclear-Test-Ban_Treaty	link
53136	7851	20	Weapon_of_mass_destruction	Comprehensive_Nuclear-Test-Ban_Treaty	other
589015	7851	15	United_States_and_weapons_of_mass_destruction	Comprehensive_Nuclear-Test-Ban_Treaty	other
null	2962287	69	other-empty	Comprehensive_Nuclear-Test-Ban_Treaty_Organization	other
1454371	2962287	13	Robinson_Crusoe_Island	Comprehensive_Nuclear-Test-Ban_Treaty_Organization	link
31446831	2962287	20	Preparatory_Commission_for_the_Comprehensive_Nuclear-Test-Ban_Treaty_Organization	Comprehensive_Nuclear-Test-Ban_Treaty_Organization	link
30669	2962287	10	Tunguska_event	Comprehensive_Nuclear-Test-Ban_Treaty_Organization	link
null	2962287	19	other-other	Comprehensive_Nuclear-Test-Ban_Treaty_Organization	other
18507109	2962287	25	List_of_intergovernmental_organizations	Comprehensive_Nuclear-Test-Ban_Treaty_Organization	link
24644042	2962287	15	List_of_United_Nations_Organizations	Comprehensive_Nuclear-Test-Ban_Treaty_Organization	link
41529235	2962287	12	2014_AA	Comprehensive_Nuclear-Test-Ban_Treaty_Organization	link
7851	2962287	44	Comprehensive_Nuclear-Test-Ban_Treaty	Comprehensive_Nuclear-Test-Ban_Treaty_Organization	link
null	2962287	45	other-wikipedia	Comprehensive_Nuclear-Test-Ban_Treaty_Organization	other
null	2962287	364	other-google	Comprehensive_Nuclear-Test-Ban_Treaty_Organization	other
null	24042157	134	other-google	Comprehensive_Peace_Accord	other
369826	24042157	145	Nepalese_Civil_War	Comprehensive_Peace_Accord	link
null	24042157	28	other-empty	Comprehensive_Peace_Accord	other
null	14712754	135	other-empty	Comprehensive_Peace_Agreement	other
32104474	14712754	12	Sudanese_conflict_in_South_Kordofan_and_Blue_Nile	Comprehensive_Peace_Agreement	link
13265409	14712754	47	South_Sudanese_independence_referendum,_2011	Comprehensive_Peace_Agreement	link
1131537	14712754	90	Second_Sudanese_Civil_War	Comprehensive_Peace_Agreement	link
3758554	14712754	17	Lost_Boys_of_Sudan	Comprehensive_Peace_Agreement	link
null	14712754	565	other-google	Comprehensive_Peace_Agreement	other
null	14712754	27	other-wikipedia	Comprehensive_Peace_Agreement	other
null	14712754	12	other-yahoo	Comprehensive_Peace_Agreement	other
13885196	14712754	15	Abyei	Comprehensive_Peace_Agreement	link
null	14712754	49	other-other	Comprehensive_Peace_Agreement	other
null	14712754	14	other-bing	Comprehensive_Peace_Agreement	other
32350676	14712754	56	South_Sudan	Comprehensive_Peace_Agreement	link
1805468	14712754	12	United_Nations_Mission_in_Sudan	Comprehensive_Peace_Agreement	link
27421	14712754	55	Sudan	Comprehensive_Peace_Agreement	link
null	6516998	39	other-google	Comprehensive_Performance_Assessment	other
null	40004458	17	other-google	Comprehensive_Physiology	other
null	426741	22	other-google	Comprehensive_Program_for_Socialist_Economic_Integration	other
384307	426741	31	Comecon	Comprehensive_Program_for_Socialist_Economic_Integration	link
2733145	null	20	Kano_model	Comprehensive_QFD	redlink
8053588	8935636	13	Food_safety	Comprehensive_Reviews_in_Food_Science_and_Food_Safety	link
264062	8935636	23	Kombucha	Comprehensive_Reviews_in_Food_Science_and_Food_Safety	link
null	8935636	13	other-empty	Comprehensive_Reviews_in_Food_Science_and_Food_Safety	other
null	8935636	23	other-google	Comprehensive_Reviews_in_Food_Science_and_Food_Safety	other
null	29639511	168	other-google	Comprehensive_Rural_Health_Project	other
null	29639511	13	other-empty	Comprehensive_Rural_Health_Project	other
null	41389730	84	other-google	Comprehensive_Social_Security_Assistance	other
null	41389730	33	other-wikipedia	Comprehensive_Social_Security_Assistance	other
null	41389730	16	other-empty	Comprehensive_Social_Security_Assistance	other
4590784	null	16	Educational_Records_Bureau	Comprehensive_Testing_Program	redlink
41176535	42015113	69	Geneva_interim_agreement_on_the_Iranian_nuclear_program	Comprehensive_agreement_on_Iranian_nuclear_program	link
310477	42015113	13	Iran–United_States_relations	Comprehensive_agreement_on_Iranian_nuclear_program	link
24192202	42015113	144	P5+1	Comprehensive_agreement_on_Iranian_nuclear_program	link
null	42015113	37	other-other	Comprehensive_agreement_on_Iranian_nuclear_program	other
null	42015113	325	other-empty	Comprehensive_agreement_on_Iranian_nuclear_program	other
10340960	42015113	10	Timeline_of_the_nuclear_program_of_Iran	Comprehensive_agreement_on_Iranian_nuclear_program	link
721807	42015113	94	Nuclear_program_of_Iran	Comprehensive_agreement_on_Iranian_nuclear_program	link
14653	42015113	96	Iran	Comprehensive_agreement_on_Iranian_nuclear_program	link
null	42015113	108	other-wikipedia	Comprehensive_agreement_on_Iranian_nuclear_program	other
null	42015113	2299	other-google	Comprehensive_agreement_on_Iranian_nuclear_program	other
null	4832436	121	other-google	Comprehensive_emergency_management	other
null	4832436	11	other-bing	Comprehensive_emergency_management	other
4831178	4832436	24	United_States_civil_defense	Comprehensive_emergency_management	other

Showing the first 1000 rows.

SDS-2.x, Scalable Data Engineering Science

Wiki Clickstream Analysis

Data set

Let us first understand this Wikimedia data set a bit more

Loading and Exploring the data

Looking at the first few lines of the data

Create a DataFrame from this CSV

Print the schema

Display some sample data

Reading from disk vs memory

What are the top 10 articles requested?

Who sent the most traffic to Wikipedia in Feb 2015?

What were the top 5 trending articles people from Twitter were looking up in Wikipedia?

What percentage of page visits in Wikipedia are from other pages in Wikipedia itself?

Register the DataFrame to perform more complex queries

Which Wikipedia pages have the most referrals to the Donald Trump page?

YouTry: Top referrers to other 2016 US presidential candidate pages

Load a visualization library

Convert raw data to parquet

Load parquet file efficiently and quickly into a DataFrame

DataFrame in python