ScaDaMaLe Course site and book

Interactive Exploration in Twitter as Graph, Network and Timeline

by Olof Björck, Joakim Johansson, Rania Sahioun, Raazesh Sainudiin and Ivan Sadikov

This is part of Project MEP: Meme Evolution Programme and supported by databricks, AWS and a Swedish VR grant.

Copyright 2016-2020 Olof Björck, Joakim Johansson, Rania Sahioun, Ivan Sadikov and Raazesh Sainudiin

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

Run notebooks

These notebooks have needed variables and functions.

./025_b_TTTDFfunctions

Read twitter data as TTTDF

We cache and count the data as a TTTDF or Twwet-Transmission-Tree dataFrame.

Make sure the count is not too big as the driver may crash. One can do a time-varying version that truncates the visualized data to an appropriate finite number.

USAGE: val df = tweetsDF2TTTDF(tweetsJsonStringDF2TweetsDF(fromParquetFile2DF("parquetFileName")))
                  val df = tweetsDF2TTTDF(tweetsIDLong_JsonStringPairDF2TweetsDF(fromParquetFile2DF("parquetFileName")))
                  
import org.apache.spark.sql.types.{StructType, StructField, StringType}
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.sql.ColumnName
import org.apache.spark.sql.DataFrame
fromParquetFile2DF: (InputDFAsParquetFilePatternString: String)org.apache.spark.sql.DataFrame
tweetsJsonStringDF2TweetsDF: (tweetsAsJsonStringInputDF: org.apache.spark.sql.DataFrame)org.apache.spark.sql.DataFrame
tweetsIDLong_JsonStringPairDF2TweetsDF: (tweetsAsIDLong_JsonStringInputDF: org.apache.spark.sql.DataFrame)org.apache.spark.sql.DataFrame
tweetsDF2TTTDF: (tweetsInputDF: org.apache.spark.sql.DataFrame)org.apache.spark.sql.DataFrame
tweetsDF2TTTDFWithURLsAndHashtags: (tweetsInputDF: org.apache.spark.sql.DataFrame)org.apache.spark.sql.DataFrame

tweetsDF2TTTDFLightWeight: (tweetsInputDF: org.apache.spark.sql.DataFrame)org.apache.spark.sql.DataFrame
tweetsDF2TTTDFWithURLsAndHashtagsLightWeight: (tweetsInputDF: org.apache.spark.sql.DataFrame)org.apache.spark.sql.DataFrame

// Get the TTT  
val rawTweetsDir = "PATH_To_TWEETS_From_StreamingJOB"
val TTTDF = tweetsDF2TTTDFWithURLsAndHashtagsLightWeight(tweetsJsonStringDF2TweetsDF(fromParquetFile2DF(rawTweetsDir))).cache()
TTTDF.count // Check how much we got (need to check size and auto-truncate to not crash driver program... all real for now)

TTTDF: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [CurrentTweetDate: timestamp, CurrentTwID: bigint ... 35 more fields]
res2: Long = 14451

TTTDF.rdd.getNumPartitions

res14: Int = 8

Next we first write the TTTDF to a parquet file and then read it into the visualizer. This will usually be done via a fast database.

TTTDF.write.mode("overwrite")
     .parquet("PATH_TO_PARQUET_TTTDF.parquet")

val TTTDFpq = spark.read.parquet("PATH_TO_PARQUET_TTTDF.parquet")

TTTDFpq: org.apache.spark.sql.DataFrame = [CurrentTweetDate: timestamp, CurrentTwID: bigint ... 35 more fields]

TTTDFpq.rdd.getNumPartitions

res16: Int = 2

Visualize a Twitter Graph

The graph shows the twitter users in the collection organized by their popularity in terms of various status updates.

See the function visualizeGraph for details. More sophisticated D3 interactive graphs are possible with the TTTDF, especially when filtered by time intervals or interactions of interest.

./029_Viz_x_VizGraphFunction

visualizeGraph(TTTDFpq, 0, 700, 700)

*** WARNING: ***

THERE IS NO SIZE CHECKING! Driver might crash as DataFrame.collect is used.
Also, the display can only handle so and so many items at once. Too large
dataset and the visualization might be very laggy or crash.

***  USAGE:  ***

visualizeGraph(TTTDF)

****************

import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions._
visualizeGraph: (TTTDF: org.apache.spark.sql.DataFrame, tupleWeightCutOff: Int, width: Int, height: Int)Unit

Visualizing Retweet Network

Let us interactively explore the retweet network to identify tweets in terms of the number of retweets versus the number of its unique retweeeters.

./029_Viz_x_VizNetworkFunction

*** WARNING: ***

THERE IS NO SIZE CHECKING! Driver might crash as DataFrame.collect is used.
Also, the display can only handle so and so many items at once. Too large
dataset and the visualization might be very laggy or crash.

***  USAGE:  ***

visualizeNetwork(TTTDF)

****************

import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions._
visualizeNetwork: (TTTDF: org.apache.spark.sql.DataFrame, width: Int, height: Int)Unit

visualizeNetwork(TTTDFpq)

Interactively Visualize a twitter Timeline

Search through Tweets over a timeline.

./029_Viz_x_VizTimelineFunction

*** WARNING: ***

BUG: The first Tweet in the dataset doesn't display for some reason.

THERE IS NO SIZE CHECKING! Driver might crash as DataFrame.collect is used.
Also, the display can only handle so and so many items at once. Too large
dataset and the visualization might be very laggy or crash.

***  USAGE:  ***

visualizeTimeline(TTTDF)
 
****************

import org.apache.spark.sql.DataFrame
visualizeTimeline: (TTTDF: org.apache.spark.sql.DataFrame, width: Int, height: Int)Unit

visualizeTimeline(TTTDFpq)

What next?

Ideally the TTTDF and its precursor raw tweets are in silver and bronze delta.io tables and one can then use an interactive dashboard such as:

Superset is fast, lightweight, intuitive, and loaded with options that make it easy for users of all skill sets to explore and visualize their data, from simple line charts to highly detailed geospatial charts.

sds-3.x/ScaDaMaLe