Databricks notebook source exported at Sun, 26 Jun 2016 01:45:30 UTC

Scalable Data Science

Course Project by Akinwande Atanda

supported by and

The html source url of this databricks notebook and its recorded Uji Image of Uji, Dogen's Time-Being:


#Tweet Analytics

Presentation contents.

Creating Machine Learning Pipeline without Loop

  • The elasticNetParam coefficient is fixed at 1.0
  • Read the Spark ML documentation for Logistic Regression
  • The dataset “pos_neg_category” can be split into two or three categories as done in the next note. In this note, the dataset is randomly split into training and testing data
  • This notebook can be upload to create a job for scheduled training and testing of the logistic classifier algorithm

Import the required python libraries:

  • From PySpark Machine Learning module import the following packages:
    • Pipeline;
    • binarizer, tokenizer and hash tags from feature package;
    • logistic regression from regression package;
    • Multi class evaluator from evaluation package
  • Read the PySpark ML package documentation for more details

from import *
from import Pipeline
from import *
from import *
from import *
from import *
from import *

Set the Stages (Binarizer, Tokenizer, Hash Text Features, and Logistic Regression Classifier Model)

bin = Binarizer(inputCol = "category", outputCol = "label", threshold = 0.5) # Positive reviews > 0.5 threshold
tok = Tokenizer(inputCol = "review", outputCol = "word") #Note: The column "words" in the original table can also contain sentences that can be tokenized
hashTF = HashingTF(inputCol = tok.getOutputCol(), numFeatures = 50000, outputCol = "features")
lr = LogisticRegression(maxIter = 10, regParam = 0.0001, elasticNetParam = 1.0)
pipeline = Pipeline(stages = [bin, tok, hashTF, lr])

Convert the imported featurized dataset to dataframe

df = table("pos_neg_category")

Randomly split the dataframe into training and testing set

(trainingData, testData) = df.randomSplit([0.7, 0.3])

Fit the training dataset into the pipeline

model =

Test the predictability of the fitted algorithm with test dataset


display("label","prediction", "review", "probability")) # Prob of being 0 (negative) against 1 (positive)"label","prediction", "review", "probability").show(10) # Prob of being 0 (negative) against 1 (positive)

Assess the accuracy of the algorithm

evaluator = MulticlassClassificationEvaluator(
    labelCol="label", predictionCol="prediction", metricName="precision")
accuracy = evaluator.evaluate(predictionModel)

print("Logistic Regression Classifier Accuracy Rate = %g " % (accuracy))
print("Test Error = %g " % (1.0 - accuracy))

Scalable Data Science

Course Project by Akinwande Atanda

supported by and