021_recognizeActivityByRandomForest(Scala)

Loading...

ScaDaMaLe Course site and book

Activity Recognition from Accelerometer

1. ETL Using SparkSQL Windows

2. Prediction using Random Forest

This work is a simpler databricksification of Amira Lakhal's more complex framework for activity recognition:

Amira's video

See Section below on 0. Download and Load Data first.

  • Once data is loaded come back here (if you have not downloaded and loaded data into the distributed file system under your hood).
val data = sc.textFile("dbfs:///datasets/sds/ActivityRecognition/dataTraining.csv") // assumes data is loaded
data: org.apache.spark.rdd.RDD[String] = dbfs:///datasets/sds/ActivityRecognition/dataTraining.csv MapPartitionsRDD[1] at textFile at command-561044440962565:1
data.take(5).foreach(println)
"user_id","activity","timeStampAsLong","x","y","z" "user_001","Jumping",1446047227606,"4.33079","-12.72175","-3.18118" "user_001","Jumping",1446047227671,"0.575403","-0.727487","2.95007" "user_001","Jumping",1446047227735,"-1.60885","3.52607","-0.1922" "user_001","Jumping",1446047227799,"0.690364","-0.037722","1.72382"

HW (Optional): Repeat this analysis using java time instead of unix timestamp.

We can read in the data either using sqlContext or with spark, our premade entry points.

val dataDF = sqlContext.read    
    .format("com.databricks.spark.csv") // use spark.csv package
    .option("header", "true") // Use first line of all files as header
    .option("inferSchema", "true") // Automatically infer data types
    .option("delimiter", ",") // Specify the delimiter as ','
    .load("dbfs:///datasets/sds/ActivityRecognition/dataTraining.csv")
dataDF: org.apache.spark.sql.DataFrame = [user_id: string, activity: string ... 4 more fields]
val dataDFnew = spark.read.format("csv") 
  .option("inferSchema", "true") 
  .option("header", "true") 
  .option("sep", ",") 
  .load("dbfs:///datasets/sds/ActivityRecognition/dataTraining.csv")
dataDFnew: org.apache.spark.sql.DataFrame = [user_id: string, activity: string ... 4 more fields]
dataDFnew.printSchema()
root |-- user_id: string (nullable = true) |-- activity: string (nullable = true) |-- timeStampAsLong: long (nullable = true) |-- x: double (nullable = true) |-- y: double (nullable = true) |-- z: double (nullable = true)
display(dataDF) // zp.show(dataDF)
 
user_id
activity
timeStampAsLong
x
y
z
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
user_001
Jumping
1446047227606
4.33079
-12.72175
-3.18118
user_001
Jumping
1446047227671
0.575403
-0.727487
2.95007
user_001
Jumping
1446047227735
-1.60885
3.52607
-0.1922
user_001
Jumping
1446047227799
0.690364
-0.037722
1.72382
user_001
Jumping
1446047227865
3.44943
-1.68549
2.29862
user_001
Jumping
1446047227930
1.87829
-1.91542
0.880768
user_001
Jumping
1446047227995
1.57173
-5.86241
-3.75599
user_001
Jumping
1446047228059
3.41111
-17.93331
0.535886
user_001
Jumping
1446047228123
3.18118
-19.58108
5.74745
user_001
Jumping
1446047228189
7.85626
-19.2362
0.804128
user_001
Jumping
1446047228253
1.26517
-8.85139
2.18366
user_001
Jumping
1446047228318
0.077239
1.15021
1.53221
user_001
Jumping
1446047228383
0.230521
2.0699
-1.41845
user_001
Jumping
1446047228447
0.652044
-0.497565
1.76214
user_001
Jumping
1446047228512
1.53341
-0.305964
1.41725
user_001
Jumping
1446047228578
-1.07237
-1.95374
0.191003
user_001
Jumping
1446047228642
2.75966
-13.75639
0.191003

Truncated results, showing first 1000 rows.

dataDF.count()
res5: Long = 13679
dataDF.select($"user_id").distinct().show()
+--------+ | user_id| +--------+ |user_002| |user_006| |user_005| |user_001| |user_007| |user_003| |user_004| +--------+
dataDF.select($"activity").distinct().show()
+--------+ |activity| +--------+ | Sitting| | Walking| | Jumping| |Standing| | Jogging| +--------+