%md # [SDS-2.2-360-in-525-01: Intro to Apache Spark for data Scientists](https://lamastex.github.io/scalable-data-science/360-in-525/2018/01/) ### [SDS-2.2, Scalable Data Science](https://lamastex.github.io/scalable-data-science/sds/2/2/)
%md # Wiki Clickstream Analysis ** Dataset: 3.2 billion requests collected during the month of February 2015 grouped by (src, dest) ** ** Source: https://datahub.io/dataset/wikipedia-clickstream/ **  *This notebook requires Spark 1.6+.*
Wiki Clickstream Analysis
Dataset: 3.2 billion requests collected during the month of February 2015 grouped by (src, dest)
Source: https://datahub.io/dataset/wikipedia-clickstream/
This notebook requires Spark 1.6+.
Last refresh: Never
%md This notebook was originally a data analysis workflow developed with [Databricks Community Edition](https://databricks.com/blog/2016/02/17/introducing-databricks-community-edition-apache-spark-for-all.html), a free version of Databricks designed for learning [Apache Spark](https://spark.apache.org/). Here we elucidate the original python notebook ([also linked here](/#workspace/scalable-data-science/xtraResources/sparkSummitEast2016/Wikipedia Clickstream Data)) used in the talk by Michael Armbrust at Spark Summit East February 2016 shared from [https://twitter.com/michaelarmbrust/status/699969850475737088](https://twitter.com/michaelarmbrust/status/699969850475737088) (watch later) [](https://www.youtube.com/v/35Y-rqSMCCA)
This notebook was originally a data analysis workflow developed with Databricks Community Edition, a free version of Databricks designed for learning Apache Spark.
Here we elucidate the original python notebook (also linked here) used in the talk by Michael Armbrust at Spark Summit East February 2016 shared from https://twitter.com/michaelarmbrust/status/699969850475737088 (watch later)
Last refresh: Never
%md ### Data set # The data we are exploring in this lab is the February 2015 English Wikipedia Clickstream data, and it is available here: http://datahub.io/dataset/wikipedia-clickstream/resource/be85cc68-d1e6-4134-804a-fd36b94dbb82. According to Wikimedia: >"The data contains counts of (referer, resource) pairs extracted from the request logs of English Wikipedia. When a client requests a resource by following a link or performing a search, the URI of the webpage that linked to the resource is included with the request in an HTTP header called the "referer". This data captures 22 million (referer, resource) pairs from a total of 3.2 billion requests collected during the month of February 2015." The data is approximately 1.2GB and it is hosted in the following Databricks file: `/databricks-datasets/wikipedia-datasets/data-001/clickstream/raw-uncompressed`
Data set
The data we are exploring in this lab is the February 2015 English Wikipedia Clickstream data, and it is available here: http://datahub.io/dataset/wikipedia-clickstream/resource/be85cc68-d1e6-4134-804a-fd36b94dbb82.
According to Wikimedia:
"The data contains counts of (referer, resource) pairs extracted from the request logs of English Wikipedia. When a client requests a resource by following a link or performing a search, the URI of the webpage that linked to the resource is included with the request in an HTTP header called the "referer". This data captures 22 million (referer, resource) pairs from a total of 3.2 billion requests collected during the month of February 2015."
The data is approximately 1.2GB and it is hosted in the following Databricks file: /databricks-datasets/wikipedia-datasets/data-001/clickstream/raw-uncompressed
Last refresh: Never
%md ### Let us first understand this Wikimedia data set a bit more Let's read the datahub-hosted link [https://datahub.io/dataset/wikipedia-clickstream](https://datahub.io/dataset/wikipedia-clickstream) in the embedding below. Also click the [blog](http://ewulczyn.github.io/Wikipedia_Clickstream_Getting_Started/) by Ellery Wulczyn, Data Scientist at The Wikimedia Foundation, to better understand how the data was generated (remember to Right-Click and use -> and <- if navigating within the embedded html frame below).
Let us first understand this Wikimedia data set a bit more
Let's read the datahub-hosted link https://datahub.io/dataset/wikipedia-clickstream in the embedding below. Also click the blog by Ellery Wulczyn, Data Scientist at The Wikimedia Foundation, to better understand how the data was generated (remember to Right-Click and use -> and <- if navigating within the embedded html frame below).
Last refresh: Never
//This allows easy embedding of publicly available information into any other notebook //when viewing in git-book just ignore this block - you may have to manually chase the URL in frameIt("URL"). //Example usage: // displayHTML(frameIt("https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation#Topics_in_LDA",250)) def frameIt( u:String, h:Int ) : String = { """<iframe src=""""+ u+"""" width="95%" height="""" + h + """" sandbox> <p> <a href="http://spark.apache.org/docs/latest/index.html"> Fallback link for browsers that, unlikely, don't support frames </a> </p> </iframe>""" } displayHTML(frameIt("https://datahub.io/dataset/wikipedia-clickstream",500))
Last refresh: Never
%md Run the next two cells for some housekeeping.
Run the next two cells for some housekeeping.
Last refresh: Never
if (org.apache.spark.BuildInfo.sparkBranch < "1.6") sys.error("Attach this notebook to a cluster running Spark 1.6+")
val data = sc.textFile("dbfs:///databricks-datasets/wikipedia-datasets/data-001/clickstream/raw-uncompressed")
%md ##### Looking at the first few lines of the data
Looking at the first few lines of the data
Last refresh: Never
data.take(5).foreach(println)
data.take(2)
%md * The first line looks like a header * The second line (separated from the first by ",") contains data organized according to the header, i.e., `prev_id` = 3632887, `curr_id` = 121", and so on. Actually, here is the meaning of each column: - `prev_id`: if the referer does not correspond to an article in the main namespace of English Wikipedia, this value will be empty. Otherwise, it contains the unique MediaWiki page ID of the article corresponding to the referer i.e. the previous article the client was on - `curr_id`: the MediaWiki unique page ID of the article the client requested - `prev_title`: the result of mapping the referer URL to the fixed set of values described below - `curr_title`: the title of the article the client requested - `n`: the number of occurrences of the (referer, resource) pair - `type` - "link" if the referer and request are both articles and the referer links to the request - "redlink" if the referer is an article and links to the request, but the request is not in the production enwiki.page table - "other" if the *referer* and request are both articles but the referer does not link to the request. This can happen when clients search or spoof their refer
- The first line looks like a header
- The second line (separated from the first by ",") contains data organized according to the header, i.e.,
prev_id
= 3632887,curr_id
= 121", and so on.
Actually, here is the meaning of each column:
prev_id
: if the referer does not correspond to an article in the main namespace of English Wikipedia, this value will be empty. Otherwise, it contains the unique MediaWiki page ID of the article corresponding to the referer i.e. the previous article the client was oncurr_id
: the MediaWiki unique page ID of the article the client requestedprev_title
: the result of mapping the referer URL to the fixed set of values described belowcurr_title
: the title of the article the client requestedn
: the number of occurrences of the (referer, resource) pairtype
- "link" if the referer and request are both articles and the referer links to the request
- "redlink" if the referer is an article and links to the request, but the request is not in the production enwiki.page table
- "other" if the referer and request are both articles but the referer does not link to the request. This can happen when clients search or spoof their refer
Last refresh: Never
%md Referers were mapped to a fixed set of values corresponding to internal traffic or external traffic from one of the top 5 global traffic sources to English Wikipedia, based on this scheme: >- an article in the main namespace of English Wikipedia -> the article title - any Wikipedia page that is not in the main namespace of English Wikipedia -> `other-wikipedia` - an empty referer -> `other-empty` - a page from any other Wikimedia project -> `other-internal` - Google -> `other-google` - Yahoo -> `other-yahoo` - Bing -> `other-bing` - Facebook -> `other-facebook` - Twitter -> `other-twitter` - anything else -> `other-other`
Referers were mapped to a fixed set of values corresponding to internal traffic or external traffic from one of the top 5 global traffic sources to English Wikipedia, based on this scheme:
- an article in the main namespace of English Wikipedia -> the article title
- any Wikipedia page that is not in the main namespace of English Wikipedia ->
other-wikipedia
- an empty referer ->
other-empty
- a page from any other Wikimedia project ->
other-internal
- Google ->
other-google
- Yahoo ->
other-yahoo
- Bing ->
other-bing
- Facebook ->
other-facebook
- Twitter ->
other-twitter
- anything else ->
other-other
Last refresh: Never
%md In the second line of the file above, we can see there were 121 clicks from Google to the Wikipedia page on "!!" (double exclamation marks). People search for everything! * prev_id = *(nothing)* * curr_id = 3632887 *--> (Wikipedia page ID)* * n = 121 *(People clicked from Google to this page 121 times in this month.)* * prev_title = other-google *(This data record is for referals from Google.)* * curr_title = !! *(This Wikipedia page is about a double exclamation mark.)* * type = other
In the second line of the file above, we can see there were 121 clicks from Google to the Wikipedia page on "!!" (double exclamation marks). People search for everything!
- prev_id = (nothing)
- curr_id = 3632887 --> (Wikipedia page ID)
- n = 121 (People clicked from Google to this page 121 times in this month.)
- prev_title = other-google (This data record is for referals from Google.)
- curr_title = !! (This Wikipedia page is about a double exclamation mark.)
- type = other
Last refresh: Never
%md ### Create a DataFrame from this CSV * From the next Spark release - 2.0, CSV as a datasource will be part of Spark's standard release. But, we are using Spark 1.6
Create a DataFrame from this CSV
- From the next Spark release - 2.0, CSV as a datasource will be part of Spark's standard release. But, we are using Spark 1.6
Last refresh: Never
// Load the raw dataset stored as a CSV file val clickstream = sqlContext. read. format("com.databricks.spark.csv"). options(Map("header" -> "true", "delimiter" -> "\t", "mode" -> "PERMISSIVE", "inferSchema" -> "true")). load("dbfs:///databricks-datasets/wikipedia-datasets/data-001/clickstream/raw-uncompressed")
clickstream.printSchema
display(clickstream)
null | 3632887 | 121 | other-google | !! | other |
null | 3632887 | 93 | other-wikipedia | !! | other |
null | 3632887 | 46 | other-empty | !! | other |
null | 3632887 | 10 | other-other | !! | other |
64486 | 3632887 | 11 | !_(disambiguation) | !! | other |
2061699 | 2556962 | 19 | Louden_Up_Now | !!!_(album) | link |
null | 2556962 | 25 | other-empty | !!!_(album) | other |
null | 2556962 | 16 | other-google | !!!_(album) | other |
null | 2556962 | 44 | other-wikipedia | !!!_(album) | other |
64486 | 2556962 | 15 | !_(disambiguation) | !!!_(album) | link |
600744 | 2556962 | 297 | !!! | !!!_(album) | link |
null | 6893310 | 11 | other-empty | !Hero_(album) | other |
1921683 | 6893310 | 26 | !Hero | !Hero_(album) | link |
null | 6893310 | 16 | other-wikipedia | !Hero_(album) | other |
null | 6893310 | 23 | other-google | !Hero_(album) | other |
8127304 | 22602473 | 16 | Jericho_Rosales | !Oka_Tokat | link |
35978874 | 22602473 | 20 | List_of_telenovelas_of_ABS-CBN | !Oka_Tokat | link |
null | 22602473 | 57 | other-google | !Oka_Tokat | other |
null | 22602473 | 12 | other-wikipedia | !Oka_Tokat | other |
null | 22602473 | 23 | other-empty | !Oka_Tokat | other |
7360687 | 22602473 | 10 | Rica_Peralejo | !Oka_Tokat | link |
37104582 | 22602473 | 11 | Jeepney_TV | !Oka_Tokat | link |
34376590 | 22602473 | 22 | Oka_Tokat_(2012_TV_series) | !Oka_Tokat | link |
null | 6810768 | 20 | other-wikipedia | !T.O.O.H.! | other |
null | 6810768 | 81 | other-google | !T.O.O.H.! | other |
31976181 | 6810768 | 51 | List_of_death_metal_bands,_!–K | !T.O.O.H.! | link |
null | 6810768 | 35 | other-empty | !T.O.O.H.! | other |
null | 3243047 | 21 | other-empty | !_(album) | other |
1337475 | 3243047 | 208 | The_Dismemberment_Plan | !_(album) | link |
3284285 | 3243047 | 78 | The_Dismemberment_Plan_Is_Terrified | !_(album) | link |
null | 3243047 | 28 | other-wikipedia | !_(album) | other |
2098292 | 899480 | 58 | United_States_military_award_devices | "A"_Device | link |
194844 | 899480 | 15 | USS_Yorktown_(CV-5) | "A"_Device | link |
null | 899480 | 17 | other-google | "A"_Device | other |
null | 899480 | 13 | other-empty | "A"_Device | other |
null | 899480 | 29 | other-wikipedia | "A"_Device | other |
878246 | 899480 | 11 | American_Defense_Service_Medal | "A"_Device | link |
855901 | 899480 | 24 | Overseas_Service_Ribbon | "A"_Device | other |
206427 | 899480 | 33 | USS_Ranger_(CV-4) | "A"_Device | link |
773691 | 899480 | 47 | Antarctica_Service_Medal | "A"_Device | link |
2301720 | 1282996 | 43 | Kinsey_Millhone | "A"_Is_for_Alibi | link |
null | 1282996 | 45 | other-empty | "A"_Is_for_Alibi | other |
null | 1282996 | 10 | other-yahoo | "A"_Is_for_Alibi | other |
470006 | 1282996 | 207 | Sue_Grafton | "A"_Is_for_Alibi | link |
null | 1282996 | 18 | other-other | "A"_Is_for_Alibi | other |
null | 1282996 | 31 | other-wikipedia | "A"_Is_for_Alibi | other |
null | 1282996 | 272 | other-google | "A"_Is_for_Alibi | other |
39606873 | 1282996 | 10 | "W"_Is_for_Wasted | "A"_Is_for_Alibi | link |
26181056 | 9003666 | 17 | And | "And"_theory_of_conservatism | link |
null | 9003666 | 109 | other-wikipedia | "And"_theory_of_conservatism | other |
null | 9003666 | 18 | other-google | "And"_theory_of_conservatism | other |
null | 39072529 | 49 | other-google | "Bassy"_Bob_Brockmann | other |
null | 39072529 | 10 | other-other | "Bassy"_Bob_Brockmann | other |
11273993 | null | 15 | Colt_1851_Navy_Revolver | "Bigfoot"_Wallace | redlink |
12571133 | 25033979 | 12 | "V"_Is_for_Vagina | "C"_is_for_(Please_Insert_Sophomoric_Genitalia_Reference_HERE) | link |
113468 | 25033979 | 24 | The_Mission | "C"_is_for_(Please_Insert_Sophomoric_Genitalia_Reference_HERE) | link |
14096078 | 25033979 | 15 | Trent_Reznor_discography | "C"_is_for_(Please_Insert_Sophomoric_Genitalia_Reference_HERE) | other |
null | 25033979 | 42 | other-empty | "C"_is_for_(Please_Insert_Sophomoric_Genitalia_Reference_HERE) | other |
1375614 | 25033979 | 15 | Tapeworm_(band) | "C"_is_for_(Please_Insert_Sophomoric_Genitalia_Reference_HERE) | other |
159547 | 25033979 | 25 | Milla_Jovovich | "C"_is_for_(Please_Insert_Sophomoric_Genitalia_Reference_HERE) | other |
28639397 | 25033979 | 73 | Sound_into_Blood_into_Wine | "C"_is_for_(Please_Insert_Sophomoric_Genitalia_Reference_HERE) | link |
1893465 | 25033979 | 30 | Carina_Round | "C"_is_for_(Please_Insert_Sophomoric_Genitalia_Reference_HERE) | other |
33622887 | 25033979 | 10 | Conditions_of_My_Parole | "C"_is_for_(Please_Insert_Sophomoric_Genitalia_Reference_HERE) | link |
147692 | 25033979 | 25 | Tim_Alexander | "C"_is_for_(Please_Insert_Sophomoric_Genitalia_Reference_HERE) | other |
4619790 | 25033979 | 593 | Puscifer | "C"_is_for_(Please_Insert_Sophomoric_Genitalia_Reference_HERE) | link |
null | 25033979 | 36 | other-wikipedia | "C"_is_for_(Please_Insert_Sophomoric_Genitalia_Reference_HERE) | other |
null | 25033979 | 93 | other-google | "C"_is_for_(Please_Insert_Sophomoric_Genitalia_Reference_HERE) | other |
69161 | null | 51 | Tết | "Chúc_Mừng_Năm_Mới"_or_best_wishes_for_the_new_year. | redlink |
1438509 | null | 14 | List_of_Old_West_gunfighters | "Cool_Hand_Conor"_O'Neill | redlink |
null | 331586 | 6820 | other-google | "Crocodile"_Dundee | other |
null | 331586 | 20 | other-twitter | "Crocodile"_Dundee | other |
null | 331586 | 781 | other-wikipedia | "Crocodile"_Dundee | other |
489033 | 331586 | 59 | List_of_Academy_Awards_ceremonies | "Crocodile"_Dundee | link |
10040606 | 331586 | 38 | List_of_Australian_films | "Crocodile"_Dundee | other |
2564144 | 331586 | 154 | Crocodile_Dundee_in_Los_Angeles | "Crocodile"_Dundee | link |
6127928 | 331586 | 14 | Bobby_Alto | "Crocodile"_Dundee | other |
152171 | 331586 | 13 | Baz_Luhrmann | "Crocodile"_Dundee | link |
8078282 | 331586 | 348 | Australia_(2008_film) | "Crocodile"_Dundee | link |
37386608 | 331586 | 66 | 2015_in_film | "Crocodile"_Dundee | link |
34557 | 331586 | 12 | 1980s | "Crocodile"_Dundee | other |
1118809 | 331586 | 297 | "Crocodile"_Dundee_II | "Crocodile"_Dundee | link |
7033 | 331586 | 52 | Caitlin_Clarke | "Crocodile"_Dundee | other |
72766 | 331586 | 31 | Dundee_(disambiguation) | "Crocodile"_Dundee | other |
171612 | 331586 | 221 | 1986_in_film | "Crocodile"_Dundee | link |
2376452 | 331586 | 34 | Australian_New_Wave | "Crocodile"_Dundee | other |
1248074 | 331586 | 60 | David_Gulpilil | "Crocodile"_Dundee | link |
865241 | 331586 | 10 | Crocodile_Hunter | "Crocodile"_Dundee | other |
196020 | 331586 | 12 | Crocodilia | "Crocodile"_Dundee | link |
643649 | 331586 | 85 | List_of_most_watched_television_broadcasts | "Crocodile"_Dundee | link |
8306521 | 331586 | 13 | Anne_Carlisle | "Crocodile"_Dundee | other |
1448969 | 331586 | 18 | Bart_vs._Australia | "Crocodile"_Dundee | other |
70209 | 331586 | 153 | Cinema_of_Australia | "Crocodile"_Dundee | link |
4008173 | 331586 | 18 | 59th_Academy_Awards | "Crocodile"_Dundee | link |
331460 | 331586 | 17 | Bowie_knife | "Crocodile"_Dundee | link |
37882 | 331586 | 21 | Crocodile | "Crocodile"_Dundee | other |
44789934 | 331586 | 1283 | Deaths_in_2015 | "Crocodile"_Dundee | link |
22344579 | 331586 | 30 | Academy_Award_for_Best_Original_Screenplay | "Crocodile"_Dundee | link |
1872502 | 331586 | 10 | Boy-Scoutz_'n_the_Hood | "Crocodile"_Dundee | other |
5644 | 331586 | 13 | Comedy_film | "Crocodile"_Dundee | link |
458340 | 331586 | 10 | List_of_films_set_in_New_York_City | "Crocodile"_Dundee | other |
prev_id | curr_id | n | prev_title | curr_title | type |
---|
Showing the first 1000 rows.
Last refresh: Never
%md Display is a utility provided by Databricks. If you are programming directly in Spark, use the show(numRows: Int) function of DataFrame
Display is a utility provided by Databricks. If you are programming directly in Spark, use the show(numRows: Int) function of DataFrame
Last refresh: Never
clickstream.show(5)
%md ### Reading from disk vs memory The 1.2 GB Clickstream file is currently on S3, which means each time you scan through it, your Spark cluster has to read the 1.2 GB of data remotely over the network.
Reading from disk vs memory
The 1.2 GB Clickstream file is currently on S3, which means each time you scan through it, your Spark cluster has to read the 1.2 GB of data remotely over the network.
Last refresh: Never
%md Call the `count()` action to check how many rows are in the DataFrame and to see how long it takes to read the DataFrame from S3.
Call the count()
action to check how many rows are in the DataFrame and to see how long it takes to read the DataFrame from S3.
Last refresh: Never
%md * It took about several minutes to read the 1.2 GB file into your Spark cluster. The file has 22.5 million rows/lines. * Although we have called cache, remember that it is evaluated (cached) only when an action(count) is called
- It took about several minutes to read the 1.2 GB file into your Spark cluster. The file has 22.5 million rows/lines.
- Although we have called cache, remember that it is evaluated (cached) only when an action(count) is called
Last refresh: Never
%md Now call count again to see how much faster it is to read from memory
Now call count again to see how much faster it is to read from memory
Last refresh: Never
%md * Orders of magnitude faster! * If you are going to be using the same data source multiple times, it is better to cache it in memory
- Orders of magnitude faster!
- If you are going to be using the same data source multiple times, it is better to cache it in memory
Last refresh: Never
%md ### What are the top 10 articles requested? To do this we also need to order by the sum of column `n`, in descending order.
What are the top 10 articles requested?
To do this we also need to order by the sum of column n
, in descending order.
Last refresh: Never
//Type in your answer here... display(clickstream .select(clickstream("curr_title"), clickstream("n")) .groupBy("curr_title") .sum() .orderBy($"sum(n)".desc) .limit(10))
%md ### Who sent the most traffic to Wikipedia in Feb 2015? In other words, who were the top referers to Wikipedia?
Who sent the most traffic to Wikipedia in Feb 2015?
In other words, who were the top referers to Wikipedia?
Last refresh: Never
display(clickstream .select(clickstream("prev_title"), clickstream("n")) .groupBy("prev_title") .sum() .orderBy($"sum(n)".desc) .limit(10))
%md As expected, the top referer by a large margin is Google. Next comes refererless traffic (usually clients using HTTPS). The third largest sender of traffic to English Wikipedia are Wikipedia pages that are not in the main namespace (ns = 0) of English Wikipedia. Learn about the Wikipedia namespaces here: https://en.wikipedia.org/wiki/Wikipedia:Project_namespace Also, note that Twitter sends 10x more requests to Wikipedia than Facebook.
As expected, the top referer by a large margin is Google. Next comes refererless traffic (usually clients using HTTPS). The third largest sender of traffic to English Wikipedia are Wikipedia pages that are not in the main namespace (ns = 0) of English Wikipedia. Learn about the Wikipedia namespaces here: https://en.wikipedia.org/wiki/Wikipedia:Project_namespace
Also, note that Twitter sends 10x more requests to Wikipedia than Facebook.
Last refresh: Never
%md ### What were the top 5 trending articles people from Twitter were looking up in Wikipedia?
What were the top 5 trending articles people from Twitter were looking up in Wikipedia?
Last refresh: Never
//Type in your answer here... display(clickstream .select(clickstream("curr_title"), clickstream("prev_title"), clickstream("n")) .filter("prev_title = 'other-twitter'") .groupBy("curr_title") .sum() .orderBy($"sum(n)".desc) .limit(5))
%md #### What percentage of page visits in Wikipedia are from other pages in Wikipedia itself?
What percentage of page visits in Wikipedia are from other pages in Wikipedia itself?
Last refresh: Never
val allClicks = clickstream.selectExpr("sum(n)").first.getLong(0) val referals = clickstream. filter(clickstream("prev_id").isNotNull). selectExpr("sum(n)").first.getLong(0) (referals * 100.0) / allClicks
%md #### Register the DataFrame to perform more complex queries
Register the DataFrame to perform more complex queries
Last refresh: Never
%md #### Which Wikipedia pages have the most referrals to the Donald Trump page?
Which Wikipedia pages have the most referrals to the Donald Trump page?
Last refresh: Never
%sql SELECT * FROM clicks WHERE curr_title = 'Donald_Trump' AND prev_id IS NOT NULL AND prev_title != 'Main_Page' ORDER BY n DESC LIMIT 20
1861441 | 4848272 | 4658 | Ivanka_Trump | Donald_Trump | link |
4848272 | 4848272 | 2212 | Donald_Trump | Donald_Trump | link |
1209075 | 4848272 | 1855 | Melania_Trump | Donald_Trump | link |
1057887 | 4848272 | 1760 | Ivana_Trump | Donald_Trump | link |
5679119 | 4848272 | 1074 | Donald_Trump_Jr. | Donald_Trump | link |
21377251 | 4848272 | 918 | United_States_presidential_election,_2016 | Donald_Trump | link |
8095589 | 4848272 | 728 | Eric_Trump | Donald_Trump | link |
473806 | 4848272 | 652 | Marla_Maples | Donald_Trump | link |
2565136 | 4848272 | 651 | The_Trump_Organization | Donald_Trump | link |
9917693 | 4848272 | 599 | The_Celebrity_Apprentice | Donald_Trump | link |
9289480 | 4848272 | 597 | The_Apprentice_(U.S._TV_series) | Donald_Trump | link |
290327 | 4848272 | 596 | German_American | Donald_Trump | link |
12643497 | 4848272 | 585 | Comedy_Central_Roast | Donald_Trump | link |
37643999 | 4848272 | 549 | Republican_Party_presidential_candidates,_2016 | Donald_Trump | link |
417559 | 4848272 | 543 | Alan_Sugar | Donald_Trump | link |
1203316 | 4848272 | 489 | Fred_Trump | Donald_Trump | link |
303951 | 4848272 | 426 | Vince_McMahon | Donald_Trump | link |
6191053 | 4848272 | 413 | Jared_Kushner | Donald_Trump | link |
1295216 | 4848272 | 412 | Trump_Tower_(New_York_City) | Donald_Trump | link |
6509278 | 4848272 | 402 | Trump | Donald_Trump | link |
prev_id | curr_id | n | prev_title | curr_title | type |
---|
Last refresh: Never
%md #### Top referrers to all presidential candidate pages
Top referrers to all presidential candidate pages
Last refresh: Never
%sql -- FIXME (broke query, will get back to it later) SELECT * FROM clicks WHERE prev_id IS NOT NULL ORDER BY n DESC LIMIT 20
15580374 | 44789934 | 769616 | Main_Page | Deaths_in_2015 | link |
35166850 | 40218034 | 368694 | Fifty_Shades_of_Grey | Fifty_Shades_of_Grey_(film) | link |
40218034 | 7000810 | 284352 | Fifty_Shades_of_Grey_(film) | Dakota_Johnson | link |
35793706 | 37371793 | 253460 | Arrow_(TV_series) | List_of_Arrow_episodes | link |
35166850 | 43180929 | 249155 | Fifty_Shades_of_Grey | Fifty_Shades_Darker | link |
40218034 | 6138391 | 228742 | Fifty_Shades_of_Grey_(film) | Jamie_Dornan | link |
43180929 | 35910161 | 220788 | Fifty_Shades_Darker | Fifty_Shades_Freed | link |
27676616 | 40265175 | 192321 | The_Walking_Dead_(TV_series) | The_Walking_Dead_(season_5) | link |
6138391 | 1076962 | 185700 | Jamie_Dornan | Amelia_Warner | link |
19376148 | 44375105 | 185449 | Stephen_Hawking | Jane_Wilde_Hawking | link |
27676616 | 28074027 | 161407 | The_Walking_Dead_(TV_series) | List_of_The_Walking_Dead_episodes | link |
34149123 | 41844524 | 161081 | List_of_The_Flash_episodes | The_Flash_(2014_TV_series) | other |
11269605 | 13542396 | 156313 | The_Big_Bang_Theory | List_of_The_Big_Bang_Theory_episodes | link |
39462431 | 34271398 | 152892 | American_Sniper_(film) | Chris_Kyle | link |
15580374 | 1738148 | 148820 | Main_Page | Limpet | other |
15580374 | 45298077 | 140335 | Main_Page | TransAsia_Airways_Flight_235 | other |
7000810 | 484101 | 139682 | Dakota_Johnson | Melanie_Griffith | link |
45119310 | 42567340 | 138179 | Take_Me_to_Church | Take_Me_to_Church_(Hozier_song) | link |
38962787 | 41126542 | 136236 | The_Blacklist_(TV_series) | List_of_The_Blacklist_episodes | link |
32262767 | 45305174 | 135900 | Better_Call_Saul | Uno_(Better_Call_Saul) | link |
prev_id | curr_id | n | prev_title | curr_title | type |
---|
Last refresh: Never
%md #### Load a visualization library This code is copied after doing a live google search (by Michael Armbrust at Spark Summit East February 2016 shared from [https://twitter.com/michaelarmbrust/status/699969850475737088](https://twitter.com/michaelarmbrust/status/699969850475737088)). The `d3ivan` package is an updated version of the original package used by Michael Armbrust as it needed some TLC for Spark 2.2 on newer databricks notebook. These changes were kindly made by Ivan Sadikov from Middle Earth.
Load a visualization library
This code is copied after doing a live google search (by Michael Armbrust at Spark Summit East February 2016
shared from https://twitter.com/michaelarmbrust/status/699969850475737088). The d3ivan
package is an updated version of the original package used by Michael Armbrust as it needed some TLC for Spark 2.2 on newer databricks notebook. These changes were kindly made by Ivan Sadikov from Middle Earth.
Last refresh: Never
package d3ivan // We use a package object so that we can define top level classes like Edge that need to be used in other cells import org.apache.spark.sql._ import com.databricks.backend.daemon.driver.EnhancedRDDFunctions.displayHTML case class Edge(src: String, dest: String, count: Long) case class Node(name: String) case class Link(source: Int, target: Int, value: Long) case class Graph(nodes: Seq[Node], links: Seq[Link]) object graphs { // val sqlContext = SQLContext.getOrCreate(org.apache.spark.SparkContext.getOrCreate()) /// fix val sqlContext = SparkSession.builder().getOrCreate().sqlContext import sqlContext.implicits._ def force(clicks: Dataset[Edge], height: Int = 100, width: Int = 960): Unit = { val data = clicks.collect() val nodes = (data.map(_.src) ++ data.map(_.dest)).map(_.replaceAll("_", " ")).toSet.toSeq.map(Node) val links = data.map { t => Link(nodes.indexWhere(_.name == t.src.replaceAll("_", " ")), nodes.indexWhere(_.name == t.dest.replaceAll("_", " ")), t.count / 20 + 1) } showGraph(height, width, Seq(Graph(nodes, links)).toDF().toJSON.first()) } /** * Displays a force directed graph using d3 * input: {"nodes": [{"name": "..."}], "links": [{"source": 1, "target": 2, "value": 0}]} */ def showGraph(height: Int, width: Int, graph: String): Unit = { displayHTML(s""" <style> .node_circle { stroke: #777; stroke-width: 1.3px; } .node_label { pointer-events: none; } .link { stroke: #777; stroke-opacity: .2; } .node_count { stroke: #777; stroke-width: 1.0px; fill: #999; } text.legend { font-family: Verdana; font-size: 13px; fill: #000; } .node text { font-family: "Helvetica Neue","Helvetica","Arial",sans-serif; font-size: 17px; font-weight: 200; } </style> <div id="clicks-graph"> <script src="//d3js.org/d3.v3.min.js"></script> <script> var graph = $graph; var width = $width, height = $height; var color = d3.scale.category20(); var force = d3.layout.force() .charge(-700) .linkDistance(180) .size([width, height]); var svg = d3.select("#clicks-graph").append("svg") .attr("width", width) .attr("height", height); force .nodes(graph.nodes) .links(graph.links) .start(); var link = svg.selectAll(".link") .data(graph.links) .enter().append("line") .attr("class", "link") .style("stroke-width", function(d) { return Math.sqrt(d.value); }); var node = svg.selectAll(".node") .data(graph.nodes) .enter().append("g") .attr("class", "node") .call(force.drag); node.append("circle") .attr("r", 10) .style("fill", function (d) { if (d.name.startsWith("other")) { return color(1); } else { return color(2); }; }) node.append("text") .attr("dx", 10) .attr("dy", ".35em") .text(function(d) { return d.name }); //Now we are giving the SVGs co-ordinates - the force layout is generating the co-ordinates which this code is using to update the attributes of the SVG elements force.on("tick", function () { link.attr("x1", function (d) { return d.source.x; }) .attr("y1", function (d) { return d.source.y; }) .attr("x2", function (d) { return d.target.x; }) .attr("y2", function (d) { return d.target.y; }); d3.selectAll("circle").attr("cx", function (d) { return d.x; }) .attr("cy", function (d) { return d.y; }); d3.selectAll("text").attr("x", function (d) { return d.x; }) .attr("y", function (d) { return d.y; }); }); </script> </div> """) } def help() = { displayHTML(""" <p> Produces a force-directed graph given a collection of edges of the following form:</br> <tt><font color="#a71d5d">case class</font> <font color="#795da3">Edge</font>(<font color="#ed6a43">src</font>: <font color="#a71d5d">String</font>, <font color="#ed6a43">dest</font>: <font color="#a71d5d">String</font>, <font color="#ed6a43">count</font>: <font color="#a71d5d">Long</font>)</tt> </p> <p>Usage:<br/> <tt><font color="#a71d5d">import</font> <font color="#ed6a43">d3._</font></tt><br/> <tt><font color="#795da3">graphs.force</font>(</br> <font color="#ed6a43">height</font> = <font color="#795da3">500</font>,<br/> <font color="#ed6a43">width</font> = <font color="#795da3">500</font>,<br/> <font color="#ed6a43">clicks</font>: <font color="#795da3">Dataset</font>[<font color="#795da3">Edge</font>])</tt> </p>""") } }
d3ivan.graphs.force( height = 800, width = 1000, clicks = sql(""" SELECT prev_title AS src, curr_title AS dest, n AS count FROM clicks WHERE curr_title IN ('Donald_Trump', 'Bernie_Sanders', 'Hillary_Rodham_Clinton', 'Ted_Cruz') AND prev_id IS NOT NULL AND prev_title != 'Main_Page' ORDER BY n DESC LIMIT 20""").as[d3ivan.Edge])
Last refresh: Never
SDS-2.2-360-in-525-01: Intro to Apache Spark for data Scientists
SDS-2.2, Scalable Data Science
Last refresh: Never