ScaDaMaLe Course site and book

By Guangyi Zhang (guaz@kth.se)

Please click HERE to watch the accompanying video.

This project aims to verify the friend-foe motifs in a large-scale signed social network.

A signed network is a graph that contains both positive and negative links. The sign of a link contains rich semantics in different appliations. For example, in a social network, positive links can indicate friendly relationships, while negative ones indicate antagonistic interactions.

In on-line discussion sites such as Slashdot, users can tag other users as “friends” and “foes”. These provide us exemplary datasets to study a online signed network. In this notebook we explore the a dataset from Epinions, which contains up to 119,217 nodes, 841,200 edges, and millions of motifs. Epinions is the trust network of the Epinions product review web site, where users can indicate their trust or distrust of the reviews of others. We analyze the network data in an undirected representation.

References:

Leskovec, Jure, Daniel Huttenlocher, and Jon Kleinberg. "Signed networks in social media." Proceedings of the SIGCHI conference on human factors in computing systems. 2010.

Regarding the motifs, we investigate several interesting triads that are related to structural balance theory in an online social signed network. Structural balance originates in social psychology in the mid-20th-century, and considers the possible ways in which triangles on three individuals can be signed.

Let us explain different types of triads, which is shown in the figure below,

  • T3: “the friend of my friend is my friend”
  • T1: “the friend of my enemy is my enemy,” “the enemy of my friend is my enemy” and “the enemy of my enemy is my friend”
  • T2 and T0: does not quite make sense in social network. For example, two friends of mine are unlikely to be enemy to each other.

Our goal is to compare the numbers of different triads in our appointed dataset.

triads

pwd
/databricks/driver
wget http://snap.stanford.edu/data/soc-sign-epinions.txt.gz
ls -l
total 2924 drwxr-xr-x 2 root root 4096 Jan 1 1970 conf -rw-r--r-- 1 root root 733 Nov 24 15:24 derby.log drwxr-xr-x 10 root root 4096 Nov 24 15:24 eventlogs drwxr-xr-x 2 root root 4096 Nov 24 16:15 ganglia drwxr-xr-x 2 root root 4096 Nov 24 16:04 logs -rw-r--r-- 1 root root 2972840 Dec 3 2009 soc-sign-epinions.txt.gz
gunzip soc-sign-epinions.txt.gz
ls -l
total 11000 drwxr-xr-x 2 root root 4096 Jan 1 1970 conf -rw-r--r-- 1 root root 733 Nov 24 15:24 derby.log drwxr-xr-x 10 root root 4096 Nov 24 15:24 eventlogs drwxr-xr-x 2 root root 4096 Nov 24 16:15 ganglia drwxr-xr-x 2 root root 4096 Nov 24 16:04 logs -rw-r--r-- 1 root root 11243141 Dec 3 2009 soc-sign-epinions.txt
head soc-sign-epinions.txt
# Directed graph: soc-sign-epinions # Epinions signed social network # Nodes: 131828 Edges: 841372 # FromNodeId ToNodeId Sign 0 1 -1 1 128552 -1 2 3 1 4 5 -1 4 155 -1 4 558 1
mkdir -p epinions mv soc-sign-epinions.txt epinions/
ls -l /dbfs/FileStore mv epinions /dbfs/FileStore/
total 33 drwxrwxrwx 2 root root 24 May 1 2018 datasets_magellan drwxrwxrwx 2 root root 4096 Nov 24 11:14 DIGSUM-files drwxrwxrwx 2 root root 4096 Nov 24 11:14 import-stage drwxrwxrwx 2 root root 4096 Nov 24 11:14 jars drwxrwxrwx 2 root root 4096 Nov 24 11:14 plots drwxrwxrwx 2 root root 4096 Nov 24 11:14 shared_uploads drwxrwxrwx 2 root root 4096 Nov 24 11:14 simon_temp_files_feel_free_to_delete_any_time drwxrwxrwx 2 root root 4096 Nov 24 11:14 tables drwxrwxrwx 2 root root 4096 Nov 24 11:14 timelinesOfInterest mv: preserving permissions for ‘/dbfs/FileStore/epinions/soc-sign-epinions.txt’: Operation not permitted mv: preserving permissions for ‘/dbfs/FileStore/epinions’: Operation not permitted
ls /
path name size
dbfs:/FileStore/ FileStore/ 0.0
dbfs:/_checkpoint/ _checkpoint/ 0.0
dbfs:/databricks/ databricks/ 0.0
dbfs:/databricks-datasets/ databricks-datasets/ 0.0
dbfs:/databricks-results/ databricks-results/ 0.0
dbfs:/datasets/ datasets/ 0.0
dbfs:/digsum-dataframe.csv/ digsum-dataframe.csv/ 0.0
dbfs:/local_disk0/ local_disk0/ 0.0
dbfs:/ml/ ml/ 0.0
dbfs:/mnt/ mnt/ 0.0
dbfs:/mytmpdir-forUserTimeLine/ mytmpdir-forUserTimeLine/ 0.0
dbfs:/results/ results/ 0.0
dbfs:/test/ test/ 0.0
dbfs:/tmp/ tmp/ 0.0
dbfs:/tmpdir/ tmpdir/ 0.0
dbfs:/user/ user/ 0.0
ls /FileStore
path name size
dbfs:/FileStore/DIGSUM-files/ DIGSUM-files/ 0.0
dbfs:/FileStore/datasets_magellan/ datasets_magellan/ 0.0
dbfs:/FileStore/epinions/ epinions/ 0.0
dbfs:/FileStore/import-stage/ import-stage/ 0.0
dbfs:/FileStore/jars/ jars/ 0.0
dbfs:/FileStore/plots/ plots/ 0.0
dbfs:/FileStore/shared_uploads/ shared_uploads/ 0.0
dbfs:/FileStore/simon_temp_files_feel_free_to_delete_any_time/ simon_temp_files_feel_free_to_delete_any_time/ 0.0
dbfs:/FileStore/tables/ tables/ 0.0
dbfs:/FileStore/timelinesOfInterest/ timelinesOfInterest/ 0.0
ls file:/databricks/driver
path name size
file:/databricks/driver/conf/ conf/ 4096.0
file:/databricks/driver/logs/ logs/ 4096.0
file:/databricks/driver/derby.log derby.log 733.0
file:/databricks/driver/ganglia/ ganglia/ 4096.0
file:/databricks/driver/eventlogs/ eventlogs/ 4096.0
//%fs mv file:///databricks/driver/epinions /FileStore/ %fs ls /FileStore/epinions/
path name size
dbfs:/FileStore/epinions/soc-sign-epinions.txt soc-sign-epinions.txt 1.1243141e7
import org.apache.spark.sql._ import org.apache.spark.sql.functions._ import org.graphframes._ // This import is needed to use the $-notation import spark.implicits._
import org.apache.spark.sql._ import org.apache.spark.sql.functions._ import org.graphframes._ import spark.implicits._
var df = spark.read.format("csv") // .option("header", "true") .option("inferSchema", "true") .option("comment", "#") .option("sep", "\t") .load("/FileStore/epinions")
df: org.apache.spark.sql.DataFrame = [_c0: int, _c1: int ... 1 more field]
df.count()
res1: Long = 841372
df.rdd.getNumPartitions
res36: Int = 3
df.head(3)
res37: Array[org.apache.spark.sql.Row] = Array([0,1,-1], [1,128552,-1], [2,3,1])
df.printSchema()
root |-- _c0: integer (nullable = true) |-- _c1: integer (nullable = true) |-- _c2: integer (nullable = true)
val newNames = Seq("src", "dst", "rela") val e = df.toDF(newNames: _*)
newNames: Seq[String] = List(src, dst, rela) e: org.apache.spark.sql.DataFrame = [src: int, dst: int ... 1 more field]
e.printSchema()
root |-- src: integer (nullable = true) |-- dst: integer (nullable = true) |-- rela: integer (nullable = true)
// Vertex DataFrame val v = spark.range(1, 131827).toDF("id")
v: org.apache.spark.sql.DataFrame = [id: bigint]
val g = GraphFrame(v, e)
g: org.graphframes.GraphFrame = GraphFrame(v:[id: bigint], e:[src: int, dst: int ... 1 more field])
g.edges.take(3)
res15: Array[org.apache.spark.sql.Row] = Array([0,1,-1], [1,128552,-1], [2,3,1])
// val results = g.triangleCount.run()

We can not make use of the convenient API triangleCount() because it does not take the sign of edges into consideration. We need to write our own code to find triads.

First, a triad should be undirected, but our graph concists of only directed edges.

One strategy is to keep only bi-direction edges of the same sign. But we need to examine how large is the proportion of edges we will lose.

// Search for pairs of vertices with edges in both directions between them, i.e., find undirected or bidirected edges. val pair = g.find("(a)-[e1]->(b); (b)-[e2]->(a)") println(pair.count()) val filtered = pair.filter("e1.rela == e2.rela") println(filtered.count())
259751 254345 pair: org.apache.spark.sql.DataFrame = [a: struct<id: bigint>, e1: struct<src: int, dst: int ... 1 more field> ... 2 more fields] filtered: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [a: struct<id: bigint>, e1: struct<src: int, dst: int ... 1 more field> ... 2 more fields]

Fortunately, we only lose a very small amount of edges.

It also makes sense for this dataset, because if A trusts B, then it is quite unlikely that B does not trust A.

In order to count different triads, first we have to find all triads.

val triad = g.find("(a)-[eab]->(b); (b)-[eba]->(a); (b)-[ebc]->(c); (c)-[ecb]->(b); (c)-[eca]->(a); (a)-[eac]->(c)") println(triad.count())
3314925 triad: org.apache.spark.sql.DataFrame = [a: struct<id: bigint>, eab: struct<src: int, dst: int ... 1 more field> ... 7 more fields]

After finding all triads, we find each type by filtering.

val t111 = triad.filter("eab.rela = 1 AND eab.rela = ebc.rela AND ebc.rela = eca.rela") println(t111.count())
3232357 t111: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [a: struct<id: bigint>, eab: struct<src: int, dst: int ... 1 more field> ... 7 more fields]
val t000 = triad.filter("eab.rela = -1 AND eab.rela = ebc.rela AND ebc.rela = eca.rela") println(t000.count())
1610 t000: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [a: struct<id: bigint>, eab: struct<src: int, dst: int ... 1 more field> ... 7 more fields]
val t110 = triad.filter("eab.rela + ebc.rela + eca.rela = 1") println(t110.count())
62634 t110: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [a: struct<id: bigint>, eab: struct<src: int, dst: int ... 1 more field> ... 7 more fields]
val t001 = triad.filter("eab.rela + ebc.rela + eca.rela = -1") println(t001.count())
18324 t001: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [a: struct<id: bigint>, eab: struct<src: int, dst: int ... 1 more field> ... 7 more fields]
val n111 = t111.count() val n001 = t001.count() val n000 = t000.count() val n110 = t110.count() val imbalanced = n000 + n110 val balanced = n111 + n001
n111: Long = 3232357 n001: Long = 18324 n000: Long = 1610 n110: Long = 62634 imbalanced: Long = 64244 balanced: Long = 3250681

As we can see, the number of balanced triads overwhelms the number of imbalanced ones, which verifies the effectiveness of structural balance theory.

Some tests about duplicated motifs

val g: GraphFrame = examples.Graphs.friends
g: org.graphframes.GraphFrame = GraphFrame(v:[id: string, name: string ... 1 more field], e:[src: string, dst: string ... 1 more field])
display(g.edges)
src dst relationship
a b friend
b c follow
c b follow
f c follow
e f follow
e d friend
d a friend
a e friend
val motifs = g.find("(a)-[e]->(b); (b)-[e2]->(a)") motifs.show()
+----------------+--------------+----------------+--------------+ | a| e| b| e2| +----------------+--------------+----------------+--------------+ | [b, Bob, 36]|[b, c, follow]|[c, Charlie, 30]|[c, b, follow]| |[c, Charlie, 30]|[c, b, follow]| [b, Bob, 36]|[b, c, follow]| +----------------+--------------+----------------+--------------+ motifs: org.apache.spark.sql.DataFrame = [a: struct<id: string, name: string ... 1 more field>, e: struct<src: string, dst: string ... 1 more field> ... 2 more fields]

As shown above, bi-direction edges are reported twice. Therefore, each triad is counted three times. However, this does not matter in our project, because the ratios between different triads remain the same.