Databricks notebook source exported at Sun, 26 Jun 2016 01:45:30 UTC

Scalable Data Science

Course Project by Akinwande Atanda

supported by and

The html source url of this databricks notebook and its recorded Uji Image of Uji, Dogen's Time-Being :

#Tweet Analytics

Presentation contents.

Creating Machine Learning Pipeline without Loop

The elasticNetParam coefficient is fixed at 1.0
Read the Spark ML documentation for Logistic Regression
The dataset “pos_neg_category” can be split into two or three categories as done in the next note. In this note, the dataset is randomly split into training and testing data
This notebook can be upload to create a job for scheduled training and testing of the logistic classifier algorithm

Import the required python libraries:

From PySpark Machine Learning module import the following packages:
- Pipeline;
- binarizer, tokenizer and hash tags from feature package;
- logistic regression from regression package;
- Multi class evaluator from evaluation package
Read the PySpark ML package documentation for more details

from pyspark.ml import *
from pyspark.ml import Pipeline
from pyspark.ml.feature import *
from pyspark.ml.classification import *
from pyspark.ml.tuning import *
from pyspark.ml.evaluation import *
from pyspark.ml.regression import *

Set the Stages (Binarizer, Tokenizer, Hash Text Features, and Logistic Regression Classifier Model)

bin = Binarizer(inputCol = "category", outputCol = "label", threshold = 0.5) # Positive reviews > 0.5 threshold
tok = Tokenizer(inputCol = "review", outputCol = "word") #Note: The column "words" in the original table can also contain sentences that can be tokenized
hashTF = HashingTF(inputCol = tok.getOutputCol(), numFeatures = 50000, outputCol = "features")
lr = LogisticRegression(maxIter = 10, regParam = 0.0001, elasticNetParam = 1.0)
pipeline = Pipeline(stages = [bin, tok, hashTF, lr])

Convert the imported featurized dataset to dataframe

df = table("pos_neg_category")

Randomly split the dataframe into training and testing set

(trainingData, testData) = df.randomSplit([0.7, 0.3])

Fit the training dataset into the pipeline

model = pipeline.fit(trainingData)

Test the predictability of the fitted algorithm with test dataset

predictionModel=model.transform(testData)

display(predictionModel.select("label","prediction", "review", "probability")) # Prob of being 0 (negative) against 1 (positive)

predictionModel.select("label","prediction", "review", "probability").show(10) # Prob of being 0 (negative) against 1 (positive)

Assess the accuracy of the algorithm

evaluator = MulticlassClassificationEvaluator(
    labelCol="label", predictionCol="prediction", metricName="precision")
accuracy = evaluator.evaluate(predictionModel)

print("Logistic Regression Classifier Accuracy Rate = %g " % (accuracy))
print("Test Error = %g " % (1.0 - accuracy))

Databricks notebook source exported at Sun, 26 Jun 2016 01:45:30 UTC

Scalable Data Science

Course Project by Akinwande Atanda

Creating Machine Learning Pipeline without Loop

Import the required python libraries:

Set the Stages (Binarizer, Tokenizer, Hash Text Features, and Logistic Regression Classifier Model)

Convert the imported featurized dataset to dataframe

Randomly split the dataframe into training and testing set

Fit the training dataset into the pipeline

Test the predictability of the fitted algorithm with test dataset

Assess the accuracy of the algorithm

Scalable Data Science

Course Project by Akinwande Atanda

Share on