ScaDaMaLe Course site and book

Data Processing

Here we will load the data as it was given to us, as a ".npy" file, and rewrite it in a simpler format. Why we do this is motivated in "Coding_Motifs", subsection "application".

# Load packages

import numpy as np

Read the data as a numpy object (since we got it like that), and save it as a ".csv" since scala can read that.


M = np.load('/dbfs/FileStore/shared_uploads/adlindhe@kth.se/M.npy')
np.savetxt("/dbfs/FileStore/shared_uploads/adlindhe@kth.se/M.csv", M, delimiter=",")

Right now the data is an adjacency matrix of size 31346 times 31346 but only 0.7% of all entries are 1. Thus we would like to save it down as something a little easier to handle. We can read the adjacency matrix directly as a dataframe, but that does not work well with graphframes. Instead we want it as an edgelist.

/*
** Example on how we can read the data.
*/
val Ms = spark.read.format("csv").option("sep",",").option("MaxColumns",40000).load("/FileStore/shared_uploads/adlindhe@kth.se/M.csv")

Thus we rewrite it as a edgelist. Thus we can more easily load it into graphframes.

# Loop over the whole matrix. This takes time (~ 1h).

edges_file = open("/dbfs/FileStore/shared_uploads/petterre@kth.se/edges.csv", "w")

for i in range(len(M)):
  for j in range(len(M[i])):
    for k in range(M[i][j]):
      edges_file.write(str(i) + "," + str(j) + ",edge\n")

edges_file.close()

Look at it to see if it looks ok.

head /dbfs/FileStore/shared_uploads/petterre@kth.se/edges.csv
Ms.cache // cache the DataFrame this time
Ms.count // now after this action, the next count will be fast...
display(Ms)