ScaDaMaLe Course site and book

This is a 2019-2021 augmentation and update of Adam Breindel's initial notebooks.

Thanks to Christian von Koch and William Anzén for their contributions towards making these materials Spark 3.0.1 and Python 3+ compliant.

As we dive into more hands-on works, let's recap some basic guidelines:

  1. Structure of your network is the first thing to work with, before worrying about the precise number of neurons, size of convolution filters etc.

  2. "Business records" or fairly (ideally?) uncorrelated predictors -- use Dense Perceptron Layer(s)

  3. Data that has 2-D patterns: 2D Convolution layer(s)

  4. For activation of hidden layers, when in doubt, use ReLU

  5. Output:

  • Regression: 1 neuron with linear activation
  • For k-way classification: k neurons with softmax activation
  1. Deeper networks are "smarter" than wider networks (in terms of abstraction)

  2. More neurons & layers \(\to\) more capacity \(\to\) more data \(\to\) more regularization (to prevent overfitting)

  3. If you don't have any specific reason not to use the "adam" optimizer, use that one

  4. Errors:

  • For regression or "wide" content matching (e.g., large image similarity), use mean-square-error;
  • For classification or narrow content matching, use cross-entropy
  1. As you simplify and abstract from your raw data, you should need less features/parameters, so your layers probably become smaller and simpler.

As a baseline, let's start a lab running with what we already know.

We'll take our deep feed-forward multilayer perceptron network, with ReLU activations and reasonable initializations, and apply it to learning the MNIST digits.

The main part of the code looks like the following (full code you can run is in the next cell):

# imports, setup, load data sets

model = Sequential()
model.add(Dense(20, input_dim=784, kernel_initializer='normal', activation='relu'))
model.add(Dense(15, kernel_initializer='normal', activation='relu'))
model.add(Dense(10, kernel_initializer='normal', activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['categorical_accuracy'])

categorical_labels = to_categorical(y_train, num_classes=10)

history = model.fit(X_train, categorical_labels, epochs=100, batch_size=100)

# print metrics, plot errors

Note the changes, which are largely about building a classifier instead of a regression model:

  • Output layer has one neuron per category, with softmax activation
  • Loss function is cross-entropy loss
  • Accuracy metric is categorical accuracy
from keras.models import Sequential
from keras.layers import Dense
from keras.utils import to_categorical
import sklearn.datasets
import datetime
import matplotlib.pyplot as plt
import numpy as np

train_libsvm = "/dbfs/databricks-datasets/mnist-digits/data-001/mnist-digits-train.txt"
test_libsvm = "/dbfs/databricks-datasets/mnist-digits/data-001/mnist-digits-test.txt"

X_train, y_train = sklearn.datasets.load_svmlight_file(train_libsvm, n_features=784)
X_train = X_train.toarray()

X_test, y_test = sklearn.datasets.load_svmlight_file(test_libsvm, n_features=784)
X_test = X_test.toarray()

model = Sequential()
model.add(Dense(20, input_dim=784, kernel_initializer='normal', activation='relu'))
model.add(Dense(15, kernel_initializer='normal', activation='relu'))
model.add(Dense(10, kernel_initializer='normal', activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['categorical_accuracy'])

categorical_labels = to_categorical(y_train, num_classes=10)
start = datetime.datetime.today()

history = model.fit(X_train, categorical_labels, epochs=40, batch_size=100, validation_split=0.1, verbose=2)

scores = model.evaluate(X_test, to_categorical(y_test, num_classes=10))

print
for i in range(len(model.metrics_names)):
	print("%s: %f" % (model.metrics_names[i], scores[i]))

print ("Start: " + str(start))
end = datetime.datetime.today()
print ("End: " + str(end))
print ("Elapse: " + str(end-start))
Using TensorFlow backend.
WARNING:tensorflow:From /databricks/python/lib/python3.7/site-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
WARNING:tensorflow:From /databricks/python/lib/python3.7/site-packages/tensorflow/python/ops/math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
Train on 54000 samples, validate on 6000 samples
Epoch 1/40
 - 3s - loss: 0.5395 - categorical_accuracy: 0.8275 - val_loss: 0.2016 - val_categorical_accuracy: 0.9428
Epoch 2/40
 - 2s - loss: 0.2209 - categorical_accuracy: 0.9353 - val_loss: 0.1561 - val_categorical_accuracy: 0.9545
Epoch 3/40
 - 2s - loss: 0.1777 - categorical_accuracy: 0.9466 - val_loss: 0.1491 - val_categorical_accuracy: 0.9568
Epoch 4/40
 - 2s - loss: 0.1561 - categorical_accuracy: 0.9532 - val_loss: 0.1335 - val_categorical_accuracy: 0.9612
Epoch 5/40
 - 2s - loss: 0.1376 - categorical_accuracy: 0.9581 - val_loss: 0.1303 - val_categorical_accuracy: 0.9640
Epoch 6/40
 - 3s - loss: 0.1252 - categorical_accuracy: 0.9621 - val_loss: 0.1334 - val_categorical_accuracy: 0.9612
Epoch 7/40
 - 3s - loss: 0.1171 - categorical_accuracy: 0.9641 - val_loss: 0.1292 - val_categorical_accuracy: 0.9650
Epoch 8/40
 - 3s - loss: 0.1112 - categorical_accuracy: 0.9657 - val_loss: 0.1388 - val_categorical_accuracy: 0.9638
Epoch 9/40
 - 3s - loss: 0.1054 - categorical_accuracy: 0.9683 - val_loss: 0.1241 - val_categorical_accuracy: 0.9652
Epoch 10/40
 - 2s - loss: 0.0993 - categorical_accuracy: 0.9693 - val_loss: 0.1314 - val_categorical_accuracy: 0.9652
Epoch 11/40
 - 3s - loss: 0.1004 - categorical_accuracy: 0.9690 - val_loss: 0.1426 - val_categorical_accuracy: 0.9637
Epoch 12/40
 - 3s - loss: 0.0932 - categorical_accuracy: 0.9713 - val_loss: 0.1409 - val_categorical_accuracy: 0.9653
Epoch 13/40
 - 3s - loss: 0.0915 - categorical_accuracy: 0.9716 - val_loss: 0.1403 - val_categorical_accuracy: 0.9628
Epoch 14/40
 - 3s - loss: 0.0883 - categorical_accuracy: 0.9733 - val_loss: 0.1347 - val_categorical_accuracy: 0.9662
Epoch 15/40
 - 3s - loss: 0.0855 - categorical_accuracy: 0.9737 - val_loss: 0.1371 - val_categorical_accuracy: 0.9683
Epoch 16/40
 - 3s - loss: 0.0855 - categorical_accuracy: 0.9736 - val_loss: 0.1453 - val_categorical_accuracy: 0.9663
Epoch 17/40
 - 3s - loss: 0.0805 - categorical_accuracy: 0.9744 - val_loss: 0.1374 - val_categorical_accuracy: 0.9665
Epoch 18/40
 - 2s - loss: 0.0807 - categorical_accuracy: 0.9756 - val_loss: 0.1348 - val_categorical_accuracy: 0.9685
Epoch 19/40
 - 2s - loss: 0.0809 - categorical_accuracy: 0.9748 - val_loss: 0.1433 - val_categorical_accuracy: 0.9662
Epoch 20/40
 - 2s - loss: 0.0752 - categorical_accuracy: 0.9766 - val_loss: 0.1415 - val_categorical_accuracy: 0.9667
Epoch 21/40
 - 3s - loss: 0.0736 - categorical_accuracy: 0.9771 - val_loss: 0.1575 - val_categorical_accuracy: 0.9650
Epoch 22/40
 - 3s - loss: 0.0743 - categorical_accuracy: 0.9768 - val_loss: 0.1517 - val_categorical_accuracy: 0.9670
Epoch 23/40
 - 3s - loss: 0.0727 - categorical_accuracy: 0.9770 - val_loss: 0.1458 - val_categorical_accuracy: 0.9680
Epoch 24/40
 - 3s - loss: 0.0710 - categorical_accuracy: 0.9778 - val_loss: 0.1618 - val_categorical_accuracy: 0.9645
Epoch 25/40
 - 3s - loss: 0.0687 - categorical_accuracy: 0.9789 - val_loss: 0.1499 - val_categorical_accuracy: 0.9650
Epoch 26/40
 - 3s - loss: 0.0680 - categorical_accuracy: 0.9787 - val_loss: 0.1448 - val_categorical_accuracy: 0.9685
Epoch 27/40
 - 3s - loss: 0.0685 - categorical_accuracy: 0.9788 - val_loss: 0.1533 - val_categorical_accuracy: 0.9665
Epoch 28/40
 - 3s - loss: 0.0677 - categorical_accuracy: 0.9786 - val_loss: 0.1668 - val_categorical_accuracy: 0.9640
Epoch 29/40
 - 3s - loss: 0.0631 - categorical_accuracy: 0.9809 - val_loss: 0.1739 - val_categorical_accuracy: 0.9632
Epoch 30/40
 - 3s - loss: 0.0687 - categorical_accuracy: 0.9780 - val_loss: 0.1584 - val_categorical_accuracy: 0.9653
Epoch 31/40
 - 3s - loss: 0.0644 - categorical_accuracy: 0.9799 - val_loss: 0.1724 - val_categorical_accuracy: 0.9678
Epoch 32/40
 - 3s - loss: 0.0621 - categorical_accuracy: 0.9807 - val_loss: 0.1709 - val_categorical_accuracy: 0.9648
Epoch 33/40
 - 4s - loss: 0.0618 - categorical_accuracy: 0.9804 - val_loss: 0.2055 - val_categorical_accuracy: 0.9592
Epoch 34/40
 - 4s - loss: 0.0620 - categorical_accuracy: 0.9804 - val_loss: 0.1752 - val_categorical_accuracy: 0.9650
Epoch 35/40
 - 3s - loss: 0.0586 - categorical_accuracy: 0.9820 - val_loss: 0.1726 - val_categorical_accuracy: 0.9643
Epoch 36/40
 - 3s - loss: 0.0606 - categorical_accuracy: 0.9804 - val_loss: 0.1851 - val_categorical_accuracy: 0.9622
Epoch 37/40
 - 3s - loss: 0.0592 - categorical_accuracy: 0.9814 - val_loss: 0.1820 - val_categorical_accuracy: 0.9643
Epoch 38/40
 - 3s - loss: 0.0573 - categorical_accuracy: 0.9823 - val_loss: 0.1874 - val_categorical_accuracy: 0.9638
Epoch 39/40
 - 3s - loss: 0.0609 - categorical_accuracy: 0.9808 - val_loss: 0.1843 - val_categorical_accuracy: 0.9617
Epoch 40/40
 - 3s - loss: 0.0573 - categorical_accuracy: 0.9823 - val_loss: 0.1774 - val_categorical_accuracy: 0.9628

   32/10000 [..............................] - ETA: 0s
  384/10000 [>.............................] - ETA: 1s
 1088/10000 [==>...........................] - ETA: 0s
 1856/10000 [====>.........................] - ETA: 0s
 2880/10000 [=======>......................] - ETA: 0s
 3552/10000 [=========>....................] - ETA: 0s
 4352/10000 [============>.................] - ETA: 0s
 4832/10000 [=============>................] - ETA: 0s
 5536/10000 [===============>..............] - ETA: 0s
 6368/10000 [==================>...........] - ETA: 0s
 6752/10000 [===================>..........] - ETA: 0s
 7296/10000 [====================>.........] - ETA: 0s
 8064/10000 [=======================>......] - ETA: 0s
 9152/10000 [==========================>...] - ETA: 0s
10000/10000 [==============================] - 1s 73us/step
loss: 0.213623
categorical_accuracy: 0.956900
Start: 2021-02-10 10:21:41.350257
End: 2021-02-10 10:23:35.823391
Elapse: 0:01:54.473134
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
fig.set_size_inches((5,5))
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'val'], loc='upper left')
display(fig)

What are the big takeaways from this experiment?

  1. We get pretty impressive "apparent error" accuracy right from the start! A small network gets us to training accuracy 97% by epoch 20
  2. The model appears to continue to learn if we let it run, although it does slow down and oscillate a bit.
  3. Our test accuracy is about 95% after 5 epochs and never gets better ... it gets worse!
  4. Therefore, we are overfitting very quickly... most of the "training" turns out to be a waste.
  5. For what it's worth, we get 95% accuracy without much work.

This is not terrible compared to other, non-neural-network approaches to the problem. After all, we could probably tweak this a bit and do even better.

But we talked about using deep learning to solve "95%" problems or "98%" problems ... where one error in 20, or 50 simply won't work. If we can get to "multiple nines" of accuracy, then we can do things like automate mail sorting and translation, create cars that react properly (all the time) to street signs, and control systems for robots or drones that function autonomously.

You Try Now!

Try two more experiments (try them separately):

  1. Add a third, hidden layer.
  2. Increase the size of the hidden layers.

Adding another layer slows things down a little (why?) but doesn't seem to make a difference in accuracy.

Adding a lot more neurons into the first topology slows things down significantly -- 10x as many neurons, and only a marginal increase in accuracy. Notice also (in the plot) that the learning clearly degrades after epoch 50 or so.