ScaDaMaLe Course site and book

By Guangyi Zhang (guaz@kth.se)

Please click HERE to watch the accompanying video.

This project aims to verify the friend-foe motifs in a large-scale signed social network.

A signed network is a graph that contains both positive and negative links. The sign of a link contains rich semantics in different appliations. For example, in a social network, positive links can indicate friendly relationships, while negative ones indicate antagonistic interactions.

In on-line discussion sites such as Slashdot, users can tag other users as “friends” and “foes”. These provide us exemplary datasets to study a online signed network. In this notebook we explore the a dataset from Epinions, which contains up to 119,217 nodes, 841,200 edges, and millions of motifs. Epinions is the trust network of the Epinions product review web site, where users can indicate their trust or distrust of the reviews of others. We analyze the network data in an undirected representation.

References:

Leskovec, Jure, Daniel Huttenlocher, and Jon Kleinberg. "Signed networks in social media." Proceedings of the SIGCHI conference on human factors in computing systems. 2010.

Regarding the motifs, we investigate several interesting triads that are related to structural balance theory in an online social signed network. Structural balance originates in social psychology in the mid-20th-century, and considers the possible ways in which triangles on three individuals can be signed.

Let us explain different types of triads, which is shown in the figure below,

  • T3: “the friend of my friend is my friend”
  • T1: “the friend of my enemy is my enemy,” “the enemy of my friend is my enemy” and “the enemy of my enemy is my friend”
  • T2 and T0: does not quite make sense in social network. For example, two friends of mine are unlikely to be enemy to each other.

Our goal is to compare the numbers of different triads in our appointed dataset.

triads

pwd
/databricks/driver
wget http://snap.stanford.edu/data/soc-sign-epinions.txt.gz
ls -l
total 2924
drwxr-xr-x  2 root root    4096 Jan  1  1970 conf
-rw-r--r--  1 root root     733 Nov 24 15:24 derby.log
drwxr-xr-x 10 root root    4096 Nov 24 15:24 eventlogs
drwxr-xr-x  2 root root    4096 Nov 24 16:15 ganglia
drwxr-xr-x  2 root root    4096 Nov 24 16:04 logs
-rw-r--r--  1 root root 2972840 Dec  3  2009 soc-sign-epinions.txt.gz
gunzip soc-sign-epinions.txt.gz
ls -l
total 11000
drwxr-xr-x  2 root root     4096 Jan  1  1970 conf
-rw-r--r--  1 root root      733 Nov 24 15:24 derby.log
drwxr-xr-x 10 root root     4096 Nov 24 15:24 eventlogs
drwxr-xr-x  2 root root     4096 Nov 24 16:15 ganglia
drwxr-xr-x  2 root root     4096 Nov 24 16:04 logs
-rw-r--r--  1 root root 11243141 Dec  3  2009 soc-sign-epinions.txt
head soc-sign-epinions.txt
# Directed graph: soc-sign-epinions
# Epinions signed social network
# Nodes: 131828 Edges: 841372
# FromNodeId	ToNodeId	Sign
0	1	-1
1	128552	-1
2	3	1
4	5	-1
4	155	-1
4	558	1
mkdir -p epinions
mv soc-sign-epinions.txt epinions/
ls -l /dbfs/FileStore
mv epinions /dbfs/FileStore/
total 33
drwxrwxrwx 2 root root   24 May  1  2018 datasets_magellan
drwxrwxrwx 2 root root 4096 Nov 24 11:14 DIGSUM-files
drwxrwxrwx 2 root root 4096 Nov 24 11:14 import-stage
drwxrwxrwx 2 root root 4096 Nov 24 11:14 jars
drwxrwxrwx 2 root root 4096 Nov 24 11:14 plots
drwxrwxrwx 2 root root 4096 Nov 24 11:14 shared_uploads
drwxrwxrwx 2 root root 4096 Nov 24 11:14 simon_temp_files_feel_free_to_delete_any_time
drwxrwxrwx 2 root root 4096 Nov 24 11:14 tables
drwxrwxrwx 2 root root 4096 Nov 24 11:14 timelinesOfInterest
mv: preserving permissions for ‘/dbfs/FileStore/epinions/soc-sign-epinions.txt’: Operation not permitted
mv: preserving permissions for ‘/dbfs/FileStore/epinions’: Operation not permitted
ls /
path name size
dbfs:/FileStore/ FileStore/ 0.0
dbfs:/_checkpoint/ _checkpoint/ 0.0
dbfs:/databricks/ databricks/ 0.0
dbfs:/databricks-datasets/ databricks-datasets/ 0.0
dbfs:/databricks-results/ databricks-results/ 0.0
dbfs:/datasets/ datasets/ 0.0
dbfs:/digsum-dataframe.csv/ digsum-dataframe.csv/ 0.0
dbfs:/local_disk0/ local_disk0/ 0.0
dbfs:/ml/ ml/ 0.0
dbfs:/mnt/ mnt/ 0.0
dbfs:/mytmpdir-forUserTimeLine/ mytmpdir-forUserTimeLine/ 0.0
dbfs:/results/ results/ 0.0
dbfs:/test/ test/ 0.0
dbfs:/tmp/ tmp/ 0.0
dbfs:/tmpdir/ tmpdir/ 0.0
dbfs:/user/ user/ 0.0
ls /FileStore
path name size
dbfs:/FileStore/DIGSUM-files/ DIGSUM-files/ 0.0
dbfs:/FileStore/datasets_magellan/ datasets_magellan/ 0.0
dbfs:/FileStore/epinions/ epinions/ 0.0
dbfs:/FileStore/import-stage/ import-stage/ 0.0
dbfs:/FileStore/jars/ jars/ 0.0
dbfs:/FileStore/plots/ plots/ 0.0
dbfs:/FileStore/shared_uploads/ shared_uploads/ 0.0
dbfs:/FileStore/simon_temp_files_feel_free_to_delete_any_time/ simon_temp_files_feel_free_to_delete_any_time/ 0.0
dbfs:/FileStore/tables/ tables/ 0.0
dbfs:/FileStore/timelinesOfInterest/ timelinesOfInterest/ 0.0
ls file:/databricks/driver
path name size
file:/databricks/driver/conf/ conf/ 4096.0
file:/databricks/driver/logs/ logs/ 4096.0
file:/databricks/driver/derby.log derby.log 733.0
file:/databricks/driver/ganglia/ ganglia/ 4096.0
file:/databricks/driver/eventlogs/ eventlogs/ 4096.0
//%fs mv file:///databricks/driver/epinions /FileStore/
%fs ls /FileStore/epinions/
path name size
dbfs:/FileStore/epinions/soc-sign-epinions.txt soc-sign-epinions.txt 1.1243141e7
import org.apache.spark.sql._
import org.apache.spark.sql.functions._

import org.graphframes._

// This import is needed to use the $-notation
import spark.implicits._
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.graphframes._
import spark.implicits._
var df = spark.read.format("csv")
//   .option("header", "true")
  .option("inferSchema", "true")
  .option("comment", "#")
  .option("sep", "\t")
  .load("/FileStore/epinions")
df: org.apache.spark.sql.DataFrame = [_c0: int, _c1: int ... 1 more field]
df.count()
res1: Long = 841372
df.rdd.getNumPartitions 
res36: Int = 3
df.head(3)
res37: Array[org.apache.spark.sql.Row] = Array([0,1,-1], [1,128552,-1], [2,3,1])
df.printSchema()
root
 |-- _c0: integer (nullable = true)
 |-- _c1: integer (nullable = true)
 |-- _c2: integer (nullable = true)
val newNames = Seq("src", "dst", "rela")
val e = df.toDF(newNames: _*)
newNames: Seq[String] = List(src, dst, rela)
e: org.apache.spark.sql.DataFrame = [src: int, dst: int ... 1 more field]
e.printSchema()
root
 |-- src: integer (nullable = true)
 |-- dst: integer (nullable = true)
 |-- rela: integer (nullable = true)
// Vertex DataFrame
val v = spark.range(1, 131827).toDF("id")
v: org.apache.spark.sql.DataFrame = [id: bigint]
val g = GraphFrame(v, e)
g: org.graphframes.GraphFrame = GraphFrame(v:[id: bigint], e:[src: int, dst: int ... 1 more field])
g.edges.take(3)
res15: Array[org.apache.spark.sql.Row] = Array([0,1,-1], [1,128552,-1], [2,3,1])
// val results = g.triangleCount.run()

We can not make use of the convenient API triangleCount() because it does not take the sign of edges into consideration. We need to write our own code to find triads.

First, a triad should be undirected, but our graph concists of only directed edges.

One strategy is to keep only bi-direction edges of the same sign. But we need to examine how large is the proportion of edges we will lose.

// Search for pairs of vertices with edges in both directions between them, i.e., find undirected or bidirected edges.
val pair = g.find("(a)-[e1]->(b); (b)-[e2]->(a)")
println(pair.count())
val filtered = pair.filter("e1.rela == e2.rela")
println(filtered.count())
259751
254345
pair: org.apache.spark.sql.DataFrame = [a: struct<id: bigint>, e1: struct<src: int, dst: int ... 1 more field> ... 2 more fields]
filtered: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [a: struct<id: bigint>, e1: struct<src: int, dst: int ... 1 more field> ... 2 more fields]

Fortunately, we only lose a very small amount of edges.

It also makes sense for this dataset, because if A trusts B, then it is quite unlikely that B does not trust A.

In order to count different triads, first we have to find all triads.

val triad = g.find("(a)-[eab]->(b); (b)-[eba]->(a); (b)-[ebc]->(c); (c)-[ecb]->(b); (c)-[eca]->(a); (a)-[eac]->(c)")
println(triad.count())
3314925
triad: org.apache.spark.sql.DataFrame = [a: struct<id: bigint>, eab: struct<src: int, dst: int ... 1 more field> ... 7 more fields]

After finding all triads, we find each type by filtering.

val t111 = triad.filter("eab.rela = 1 AND eab.rela = ebc.rela AND ebc.rela = eca.rela")
println(t111.count())
3232357
t111: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [a: struct<id: bigint>, eab: struct<src: int, dst: int ... 1 more field> ... 7 more fields]
val t000 = triad.filter("eab.rela = -1 AND eab.rela = ebc.rela AND ebc.rela = eca.rela")
println(t000.count())
1610
t000: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [a: struct<id: bigint>, eab: struct<src: int, dst: int ... 1 more field> ... 7 more fields]
val t110 = triad.filter("eab.rela + ebc.rela + eca.rela = 1")
println(t110.count())
62634
t110: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [a: struct<id: bigint>, eab: struct<src: int, dst: int ... 1 more field> ... 7 more fields]
val t001 = triad.filter("eab.rela + ebc.rela + eca.rela = -1")
println(t001.count())
18324
t001: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [a: struct<id: bigint>, eab: struct<src: int, dst: int ... 1 more field> ... 7 more fields]
val n111 = t111.count()
val n001 = t001.count()
val n000 = t000.count() 
val n110 = t110.count()
val imbalanced = n000 + n110
val balanced = n111 + n001
n111: Long = 3232357
n001: Long = 18324
n000: Long = 1610
n110: Long = 62634
imbalanced: Long = 64244
balanced: Long = 3250681

As we can see, the number of balanced triads overwhelms the number of imbalanced ones, which verifies the effectiveness of structural balance theory.

Some tests about duplicated motifs

val g: GraphFrame = examples.Graphs.friends
g: org.graphframes.GraphFrame = GraphFrame(v:[id: string, name: string ... 1 more field], e:[src: string, dst: string ... 1 more field])
display(g.edges)
src dst relationship
a b friend
b c follow
c b follow
f c follow
e f follow
e d friend
d a friend
a e friend
val motifs = g.find("(a)-[e]->(b); (b)-[e2]->(a)")
motifs.show()
+----------------+--------------+----------------+--------------+
|               a|             e|               b|            e2|
+----------------+--------------+----------------+--------------+
|    [b, Bob, 36]|[b, c, follow]|[c, Charlie, 30]|[c, b, follow]|
|[c, Charlie, 30]|[c, b, follow]|    [b, Bob, 36]|[b, c, follow]|
+----------------+--------------+----------------+--------------+

motifs: org.apache.spark.sql.DataFrame = [a: struct<id: string, name: string ... 1 more field>, e: struct<src: string, dst: string ... 1 more field> ... 2 more fields]

As shown above, bi-direction edges are reported twice. Therefore, each triad is counted three times. However, this does not matter in our project, because the ratios between different triads remain the same.