056_DLbyABr_04a-Hands-On-MNIST-MLP(Python)

SDS-2.x, Scalable Data Engineering Science

This is a 2019 augmentation and update of Adam Breindel's initial notebooks.

As we dive into more hands-on works, let's recap some basic guidelines:

  1. Structure of your network is the first thing to work with, before worrying about the precise number of neurons, size of convolution filters etc.

  2. "Business records" or fairly (ideally?) uncorrelated predictors -- use Dense Perceptron Layer(s)

  3. Data that has 2-D patterns: 2D Convolution layer(s)

  4. For activation of hidden layers, when in doubt, use ReLU

  5. Output:

    • Regression: 1 neuron with linear activation
    • For k-way classification: k neurons with softmax activation
  6. Deeper networks are "smarter" than wider networks (in terms of abstraction)

  7. More neurons & layers → \to more capacity → \to more data → \to more regularization (to prevent overfitting)

  8. If you don't have any specific reason not to use the "adam" optimizer, use that one

  9. Errors:

    • For regression or "wide" content matching (e.g., large image similarity), use mean-square-error;
    • For classification or narrow content matching, use cross-entropy
  10. As you simplify and abstract from your raw data, you should need less features/parameters, so your layers probably become smaller and simpler.

As a baseline, let's start a lab running with what we already know.

We'll take our deep feed-forward multilayer perceptron network, with ReLU activations and reasonable initializations, and apply it to learning the MNIST digits.

The main part of the code looks like the following (full code you can run is in the next cell):

# imports, setup, load data sets

model = Sequential()
model.add(Dense(20, input_dim=784, kernel_initializer='normal', activation='relu'))
model.add(Dense(15, kernel_initializer='normal', activation='relu'))
model.add(Dense(10, kernel_initializer='normal', activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['categorical_accuracy'])

categorical_labels = to_categorical(y_train, num_classes=10)

history = model.fit(X_train, categorical_labels, epochs=100, batch_size=100)

# print metrics, plot errors

Note the changes, which are largely about building a classifier instead of a regression model:

  • Output layer has one neuron per category, with softmax activation
  • Loss function is cross-entropy loss
  • Accuracy metric is categorical accuracy
from keras.models import Sequential
from keras.layers import Dense
from keras.utils import to_categorical
import sklearn.datasets
import datetime
import matplotlib.pyplot as plt
import numpy as np

train_libsvm = "/dbfs/databricks-datasets/mnist-digits/data-001/mnist-digits-train.txt"
test_libsvm = "/dbfs/databricks-datasets/mnist-digits/data-001/mnist-digits-test.txt"

X_train, y_train = sklearn.datasets.load_svmlight_file(train_libsvm, n_features=784)
X_train = X_train.toarray()

X_test, y_test = sklearn.datasets.load_svmlight_file(test_libsvm, n_features=784)
X_test = X_test.toarray()

model = Sequential()
model.add(Dense(20, input_dim=784, kernel_initializer='normal', activation='relu'))
model.add(Dense(15, kernel_initializer='normal', activation='relu'))
model.add(Dense(10, kernel_initializer='normal', activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['categorical_accuracy'])

categorical_labels = to_categorical(y_train, num_classes=10)
start = datetime.datetime.today()

history = model.fit(X_train, categorical_labels, epochs=40, batch_size=100, validation_split=0.1, verbose=2)

scores = model.evaluate(X_test, to_categorical(y_test, num_classes=10))

print
for i in range(len(model.metrics_names)):
    print("%s: %f" % (model.metrics_names[i], scores[i]))

print ("Start: " + str(start))
end = datetime.datetime.today()
print ("End: " + str(end))
print ("Elapse: " + str(end-start))
Using TensorFlow backend. WARNING:tensorflow:From /databricks/python/local/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version. Instructions for updating: Colocations handled automatically by placer. WARNING:tensorflow:From /databricks/python/local/lib/python2.7/site-packages/tensorflow/python/ops/math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.cast instead. Train on 54000 samples, validate on 6000 samples Epoch 1/40 - 1s - loss: 0.6256 - categorical_accuracy: 0.8013 - val_loss: 0.2426 - val_categorical_accuracy: 0.9318 Epoch 2/40 - 1s - loss: 0.2634 - categorical_accuracy: 0.9227 - val_loss: 0.2033 - val_categorical_accuracy: 0.9437 Epoch 3/40 - 1s - loss: 0.2136 - categorical_accuracy: 0.9376 - val_loss: 0.1813 - val_categorical_accuracy: 0.9515 Epoch 4/40 - 1s - loss: 0.1829 - categorical_accuracy: 0.9462 - val_loss: 0.1691 - val_categorical_accuracy: 0.9522 Epoch 5/40 - 1s - loss: 0.1646 - categorical_accuracy: 0.9517 - val_loss: 0.1693 - val_categorical_accuracy: 0.9520 Epoch 6/40 - 1s - loss: 0.1478 - categorical_accuracy: 0.9558 - val_loss: 0.1525 - val_categorical_accuracy: 0.9598 Epoch 7/40 - 1s - loss: 0.1370 - categorical_accuracy: 0.9600 - val_loss: 0.1405 - val_categorical_accuracy: 0.9610 Epoch 8/40 - 1s - loss: 0.1280 - categorical_accuracy: 0.9625 - val_loss: 0.1371 - val_categorical_accuracy: 0.9615 Epoch 9/40 - 1s - loss: 0.1211 - categorical_accuracy: 0.9639 - val_loss: 0.1721 - val_categorical_accuracy: 0.9503 Epoch 10/40 - 1s - loss: 0.1177 - categorical_accuracy: 0.9645 - val_loss: 0.1383 - val_categorical_accuracy: 0.9648 Epoch 11/40 - 1s - loss: 0.1101 - categorical_accuracy: 0.9672 - val_loss: 0.1323 - val_categorical_accuracy: 0.9650 Epoch 12/40 - 1s - loss: 0.1037 - categorical_accuracy: 0.9693 - val_loss: 0.1647 - val_categorical_accuracy: 0.9585 Epoch 13/40 - 1s - loss: 0.1002 - categorical_accuracy: 0.9702 - val_loss: 0.1459 - val_categorical_accuracy: 0.9603 Epoch 14/40 - 1s - loss: 0.0990 - categorical_accuracy: 0.9697 - val_loss: 0.1997 - val_categorical_accuracy: 0.9448 Epoch 15/40 - 1s - loss: 0.0963 - categorical_accuracy: 0.9705 - val_loss: 0.1397 - val_categorical_accuracy: 0.9640 Epoch 16/40 - 1s - loss: 0.0911 - categorical_accuracy: 0.9718 - val_loss: 0.1362 - val_categorical_accuracy: 0.9670 Epoch 17/40 - 1s - loss: 0.0881 - categorical_accuracy: 0.9731 - val_loss: 0.1592 - val_categorical_accuracy: 0.9568 Epoch 18/40 - 2s - loss: 0.0856 - categorical_accuracy: 0.9739 - val_loss: 0.1480 - val_categorical_accuracy: 0.9632 Epoch 19/40 - 1s - loss: 0.0850 - categorical_accuracy: 0.9740 - val_loss: 0.1585 - val_categorical_accuracy: 0.9625 Epoch 20/40 - 2s - loss: 0.0802 - categorical_accuracy: 0.9755 - val_loss: 0.1500 - val_categorical_accuracy: 0.9630 Epoch 21/40 - 1s - loss: 0.0798 - categorical_accuracy: 0.9751 - val_loss: 0.1451 - val_categorical_accuracy: 0.9665 Epoch 22/40 - 2s - loss: 0.0780 - categorical_accuracy: 0.9762 - val_loss: 0.1499 - val_categorical_accuracy: 0.9642 Epoch 23/40 - 2s - loss: 0.0772 - categorical_accuracy: 0.9761 - val_loss: 0.1521 - val_categorical_accuracy: 0.9607 Epoch 24/40 - 2s - loss: 0.0746 - categorical_accuracy: 0.9772 - val_loss: 0.1568 - val_categorical_accuracy: 0.9630 Epoch 25/40 - 2s - loss: 0.0725 - categorical_accuracy: 0.9773 - val_loss: 0.1655 - val_categorical_accuracy: 0.9603 Epoch 26/40 - 2s - loss: 0.0723 - categorical_accuracy: 0.9776 - val_loss: 0.1782 - val_categorical_accuracy: 0.9583 Epoch 27/40 - 2s - loss: 0.0737 - categorical_accuracy: 0.9774 - val_loss: 0.1672 - val_categorical_accuracy: 0.9597 Epoch 28/40 - 1s - loss: 0.0700 - categorical_accuracy: 0.9789 - val_loss: 0.1846 - val_categorical_accuracy: 0.9562 Epoch 29/40 - 1s - loss: 0.0679 - categorical_accuracy: 0.9793 - val_loss: 0.1665 - val_categorical_accuracy: 0.9615 Epoch 30/40 - 1s - loss: 0.0711 - categorical_accuracy: 0.9782 - val_loss: 0.1678 - val_categorical_accuracy: 0.9602 Epoch 31/40 - 1s - loss: 0.0681 - categorical_accuracy: 0.9790 - val_loss: 0.1656 - val_categorical_accuracy: 0.9612 Epoch 32/40 - 1s - loss: 0.0647 - categorical_accuracy: 0.9793 - val_loss: 0.1738 - val_categorical_accuracy: 0.9608 Epoch 33/40 - 1s - loss: 0.0681 - categorical_accuracy: 0.9794 - val_loss: 0.1838 - val_categorical_accuracy: 0.9578 Epoch 34/40 - 1s - loss: 0.0664 - categorical_accuracy: 0.9797 - val_loss: 0.1817 - val_categorical_accuracy: 0.9593 Epoch 35/40 - 1s - loss: 0.0631 - categorical_accuracy: 0.9808 - val_loss: 0.1737 - val_categorical_accuracy: 0.9632 Epoch 36/40 - 1s - loss: 0.0638 - categorical_accuracy: 0.9802 - val_loss: 0.1842 - val_categorical_accuracy: 0.9607 Epoch 37/40 - 1s - loss: 0.0617 - categorical_accuracy: 0.9808 - val_loss: 0.1834 - val_categorical_accuracy: 0.9602 Epoch 38/40 - 1s - loss: 0.0630 - categorical_accuracy: 0.9803 - val_loss: 0.2150 - val_categorical_accuracy: 0.9570 Epoch 39/40 - 1s - loss: 0.0619 - categorical_accuracy: 0.9806 - val_loss: 0.1846 - val_categorical_accuracy: 0.9615 Epoch 40/40 - 1s - loss: 0.0564 - categorical_accuracy: 0.9830 - val_loss: 0.2025 - val_categorical_accuracy: 0.9608 32/10000 [..............................] - ETA: 0s 2496/10000 [======>.......................] - ETA: 0s 4800/10000 [=============>................] - ETA: 0s 7168/10000 [====================>.........] - ETA: 0s 9696/10000 [============================>.] - ETA: 0s 10000/10000 [==============================] - 0s 21us/step loss: 0.222367 categorical_accuracy: 0.956200 Start: 2019-06-08 20:47:25.065766 End: 2019-06-08 20:48:12.525942 Elapse: 0:00:47.460176
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
fig.set_size_inches((5,5))
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'val'], loc='upper left')
display(fig)

What are the big takeaways from this experiment?

  1. We get pretty impressive "apparent error" accuracy right from the start! A small network gets us to training accuracy 97% by epoch 20
  2. The model appears to continue to learn if we let it run, although it does slow down and oscillate a bit.
  3. Our test accuracy is about 95% after 5 epochs and never gets better ... it gets worse!
  4. Therefore, we are overfitting very quickly... most of the "training" turns out to be a waste.
  5. For what it's worth, we get 95% accuracy without much work.

This is not terrible compared to other, non-neural-network approaches to the problem. After all, we could probably tweak this a bit and do even better.

But we talked about using deep learning to solve "95%" problems or "98%" problems ... where one error in 20, or 50 simply won't work. If we can get to "multiple nines" of accuracy, then we can do things like automate mail sorting and translation, create cars that react properly (all the time) to street signs, and control systems for robots or drones that function autonomously.

You Try Now!

Try two more experiments (try them separately):

  1. Add a third, hidden layer.
  2. Increase the size of the hidden layers.

Adding another layer slows things down a little (why?) but doesn't seem to make a difference in accuracy.

Adding a lot more neurons into the first topology slows things down significantly -- 10x as many neurons, and only a marginal increase in accuracy. Notice also (in the plot) that the learning clearly degrades after epoch 50 or so.