055_DLbyABr_04-ConvolutionalNetworks(Python)

Loading...

ScaDaMaLe Course site and book

This is a 2019-2021 augmentation and update of Adam Breindel's initial notebooks.

Thanks to Christian von Koch and William Anzén for their contributions towards making these materials Spark 3.0.1 and Python 3+ compliant.

Convolutional Neural Networks

aka CNN, ConvNet

As a baseline, let's start a lab running with what we already know.

We'll take our deep feed-forward multilayer perceptron network, with ReLU activations and reasonable initializations, and apply it to learning the MNIST digits.

The main part of the code looks like the following (full code you can run is in the next cell):

# imports, setup, load data sets

model = Sequential()
model.add(Dense(20, input_dim=784, kernel_initializer='normal', activation='relu'))
model.add(Dense(15, kernel_initializer='normal', activation='relu'))
model.add(Dense(10, kernel_initializer='normal', activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['categorical_accuracy'])

categorical_labels = to_categorical(y_train, num_classes=10)

history = model.fit(X_train, categorical_labels, epochs=100, batch_size=100)

# print metrics, plot errors

Note the changes, which are largely about building a classifier instead of a regression model:

  • Output layer has one neuron per category, with softmax activation
  • Loss function is cross-entropy loss
  • Accuracy metric is categorical accuracy

Let's hold pointers into wikipedia for these new concepts.

Show code
Show code

The following is from: https://www.quora.com/How-does-Keras-calculate-accuracy.

Categorical accuracy:

def categorical_accuracy(y_true, y_pred):
 return K.cast(K.equal(K.argmax(y_true, axis=-1),
 K.argmax(y_pred, axis=-1)),
 K.floatx())

K.argmax(y_true) takes the highest value to be the prediction and matches against the comparative set.

Watch (1:39)

  • Udacity: Deep Learning by Vincent Vanhoucke - Cross-entropy

Watch (1:54)

  • Udacity: Deep Learning by Vincent Vanhoucke - Minimizing Cross-entropy
from keras.models import Sequential
from keras.layers import Dense
from keras.utils import to_categorical
import sklearn.datasets
import datetime
import matplotlib.pyplot as plt
import numpy as np
 
train_libsvm = "/dbfs/databricks-datasets/mnist-digits/data-001/mnist-digits-train.txt"
test_libsvm = "/dbfs/databricks-datasets/mnist-digits/data-001/mnist-digits-test.txt"
 
X_train, y_train = sklearn.datasets.load_svmlight_file(train_libsvm, n_features=784)
X_train = X_train.toarray()
 
X_test, y_test = sklearn.datasets.load_svmlight_file(test_libsvm, n_features=784)
X_test = X_test.toarray()
 
model = Sequential()
model.add(Dense(20, input_dim=784, kernel_initializer='normal', activation='relu'))
model.add(Dense(15, kernel_initializer='normal', activation='relu'))
model.add(Dense(10, kernel_initializer='normal', activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['categorical_accuracy'])
 
categorical_labels = to_categorical(y_train, num_classes=10)
start = datetime.datetime.today()
 
history = model.fit(X_train, categorical_labels, epochs=40, batch_size=100, validation_split=0.1, verbose=2)
 
scores = model.evaluate(X_test, to_categorical(y_test, num_classes=10))
 
print
for i in range(len(model.metrics_names)):
    print("%s: %f" % (model.metrics_names[i], scores[i]))
 
print ("Start: " + str(start))
end = datetime.datetime.today()
print ("End: " + str(end))
print ("Elapse: " + str(end-start))
Using TensorFlow backend. WARNING:tensorflow:From /databricks/python/lib/python3.7/site-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version. Instructions for updating: Colocations handled automatically by placer. WARNING:tensorflow:From /databricks/python/lib/python3.7/site-packages/tensorflow/python/ops/math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.cast instead. Train on 54000 samples, validate on 6000 samples Epoch 1/40 - 2s - loss: 0.5150 - categorical_accuracy: 0.8445 - val_loss: 0.2426 - val_categorical_accuracy: 0.9282 Epoch 2/40 - 2s - loss: 0.2352 - categorical_accuracy: 0.9317 - val_loss: 0.1763 - val_categorical_accuracy: 0.9487 Epoch 3/40 - 2s - loss: 0.1860 - categorical_accuracy: 0.9454 - val_loss: 0.1526 - val_categorical_accuracy: 0.9605 Epoch 4/40 - 1s - loss: 0.1602 - categorical_accuracy: 0.9527 - val_loss: 0.1600 - val_categorical_accuracy: 0.9573 Epoch 5/40 - 2s - loss: 0.1421 - categorical_accuracy: 0.9575 - val_loss: 0.1464 - val_categorical_accuracy: 0.9590 Epoch 6/40 - 1s - loss: 0.1277 - categorical_accuracy: 0.9611 - val_loss: 0.1626 - val_categorical_accuracy: 0.9568 Epoch 7/40 - 1s - loss: 0.1216 - categorical_accuracy: 0.9610 - val_loss: 0.1263 - val_categorical_accuracy: 0.9665 Epoch 8/40 - 2s - loss: 0.1136 - categorical_accuracy: 0.9656 - val_loss: 0.1392 - val_categorical_accuracy: 0.9627 Epoch 9/40 - 2s - loss: 0.1110 - categorical_accuracy: 0.9659 - val_loss: 0.1306 - val_categorical_accuracy: 0.9632 Epoch 10/40 - 1s - loss: 0.1056 - categorical_accuracy: 0.9678 - val_loss: 0.1298 - val_categorical_accuracy: 0.9643 Epoch 11/40 - 1s - loss: 0.1010 - categorical_accuracy: 0.9685 - val_loss: 0.1523 - val_categorical_accuracy: 0.9583 Epoch 12/40 - 2s - loss: 0.1003 - categorical_accuracy: 0.9692 - val_loss: 0.1540 - val_categorical_accuracy: 0.9583 Epoch 13/40 - 1s - loss: 0.0929 - categorical_accuracy: 0.9713 - val_loss: 0.1348 - val_categorical_accuracy: 0.9647 Epoch 14/40 - 1s - loss: 0.0898 - categorical_accuracy: 0.9716 - val_loss: 0.1476 - val_categorical_accuracy: 0.9628 Epoch 15/40 - 1s - loss: 0.0885 - categorical_accuracy: 0.9722 - val_loss: 0.1465 - val_categorical_accuracy: 0.9633 Epoch 16/40 - 2s - loss: 0.0840 - categorical_accuracy: 0.9736 - val_loss: 0.1584 - val_categorical_accuracy: 0.9623 Epoch 17/40 - 2s - loss: 0.0844 - categorical_accuracy: 0.9730 - val_loss: 0.1530 - val_categorical_accuracy: 0.9598 Epoch 18/40 - 2s - loss: 0.0828 - categorical_accuracy: 0.9739 - val_loss: 0.1395 - val_categorical_accuracy: 0.9662 Epoch 19/40 - 2s - loss: 0.0782 - categorical_accuracy: 0.9755 - val_loss: 0.1640 - val_categorical_accuracy: 0.9623 Epoch 20/40 - 1s - loss: 0.0770 - categorical_accuracy: 0.9760 - val_loss: 0.1638 - val_categorical_accuracy: 0.9568 Epoch 21/40 - 1s - loss: 0.0754 - categorical_accuracy: 0.9763 - val_loss: 0.1773 - val_categorical_accuracy: 0.9608 Epoch 22/40 - 2s - loss: 0.0742 - categorical_accuracy: 0.9769 - val_loss: 0.1767 - val_categorical_accuracy: 0.9603 Epoch 23/40 - 2s - loss: 0.0762 - categorical_accuracy: 0.9762 - val_loss: 0.1623 - val_categorical_accuracy: 0.9597 Epoch 24/40 - 1s - loss: 0.0724 - categorical_accuracy: 0.9772 - val_loss: 0.1647 - val_categorical_accuracy: 0.9635 Epoch 25/40 - 1s - loss: 0.0701 - categorical_accuracy: 0.9781 - val_loss: 0.1705 - val_categorical_accuracy: 0.9623 Epoch 26/40 - 2s - loss: 0.0702 - categorical_accuracy: 0.9777 - val_loss: 0.1673 - val_categorical_accuracy: 0.9658 Epoch 27/40 - 2s - loss: 0.0682 - categorical_accuracy: 0.9788 - val_loss: 0.1841 - val_categorical_accuracy: 0.9607 Epoch 28/40 - 2s - loss: 0.0684 - categorical_accuracy: 0.9790 - val_loss: 0.1738 - val_categorical_accuracy: 0.9623 Epoch 29/40 - 2s - loss: 0.0670 - categorical_accuracy: 0.9786 - val_loss: 0.1880 - val_categorical_accuracy: 0.9610 Epoch 30/40 - 2s - loss: 0.0650 - categorical_accuracy: 0.9790 - val_loss: 0.1765 - val_categorical_accuracy: 0.9650 Epoch 31/40 - 2s - loss: 0.0639 - categorical_accuracy: 0.9793 - val_loss: 0.1774 - val_categorical_accuracy: 0.9602 Epoch 32/40 - 2s - loss: 0.0660 - categorical_accuracy: 0.9791 - val_loss: 0.1885 - val_categorical_accuracy: 0.9622 Epoch 33/40 - 1s - loss: 0.0636 - categorical_accuracy: 0.9795 - val_loss: 0.1928 - val_categorical_accuracy: 0.9595 Epoch 34/40 - 1s - loss: 0.0597 - categorical_accuracy: 0.9805 - val_loss: 0.1948 - val_categorical_accuracy: 0.9593 Epoch 35/40 - 2s - loss: 0.0631 - categorical_accuracy: 0.9797 - val_loss: 0.2019 - val_categorical_accuracy: 0.9563 Epoch 36/40 - 2s - loss: 0.0600 - categorical_accuracy: 0.9812 - val_loss: 0.1852 - val_categorical_accuracy: 0.9595 Epoch 37/40 - 2s - loss: 0.0597 - categorical_accuracy: 0.9812 - val_loss: 0.1794 - val_categorical_accuracy: 0.9637 Epoch 38/40 - 2s - loss: 0.0588 - categorical_accuracy: 0.9812 - val_loss: 0.1933 - val_categorical_accuracy: 0.9625 Epoch 39/40 - 2s - loss: 0.0629 - categorical_accuracy: 0.9805 - val_loss: 0.2177 - val_categorical_accuracy: 0.9582 Epoch 40/40 - 2s - loss: 0.0559 - categorical_accuracy: 0.9828 - val_loss: 0.1875 - val_categorical_accuracy: 0.9642 32/10000 [..............................] - ETA: 0s 1568/10000 [===>..........................] - ETA: 0s 3072/10000 [========>.....................] - ETA: 0s 4672/10000 [=============>................] - ETA: 0s 6496/10000 [==================>...........] - ETA: 0s 8192/10000 [=======================>......] - ETA: 0s 10000/10000 [==============================] - 0s 30us/step loss: 0.227984 categorical_accuracy: 0.954900 Start: 2021-02-10 10:20:06.310772 End: 2021-02-10 10:21:10.141213 Elapse: 0:01:03.830441

after about a minute we have:

...

Epoch 40/40
1s - loss: 0.0610 - categorical_accuracy: 0.9809 - val_loss: 0.1918 - val_categorical_accuracy: 0.9583

...

loss: 0.216120

categorical_accuracy: 0.955000

Start: 2017-12-06 07:35:33.948102

End: 2017-12-06 07:36:27.046130

Elapse: 0:00:53.098028
import matplotlib.pyplot as plt
 
fig, ax = plt.subplots()
fig.set_size_inches((5,5))
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'val'], loc='upper left')
display(fig)

What are the big takeaways from this experiment?

  1. We get pretty impressive "apparent error" accuracy right from the start! A small network gets us to training accuracy 97% by epoch 20
  2. The model appears to continue to learn if we let it run, although it does slow down and oscillate a bit.
  3. Our test accuracy is about 95% after 5 epochs and never gets better ... it gets worse!
  4. Therefore, we are overfitting very quickly... most of the "training" turns out to be a waste.
  5. For what it's worth, we get 95% accuracy without much work.

This is not terrible compared to other, non-neural-network approaches to the problem. After all, we could probably tweak this a bit and do even better.

But we talked about using deep learning to solve "95%" problems or "98%" problems ... where one error in 20, or 50 simply won't work. If we can get to "multiple nines" of accuracy, then we can do things like automate mail sorting and translation, create cars that react properly (all the time) to street signs, and control systems for robots or drones that function autonomously.

Try two more experiments (try them separately):

  1. Add a third, hidden layer.
  2. Increase the size of the hidden layers.

Adding another layer slows things down a little (why?) but doesn't seem to make a difference in accuracy.

Adding a lot more neurons into the first topology slows things down significantly -- 10x as many neurons, and only a marginal increase in accuracy. Notice also (in the plot) that the learning clearly degrades after epoch 50 or so.

... We need a new approach!


... let's think about this:

What is layer 2 learning from layer 1? Combinations of pixels

Combinations of pixels contain information but...

There are a lot of them (combinations) and they are "fragile"

In fact, in our last experiment, we basically built a model that memorizes a bunch of "magic" pixel combinations.

What might be a better way to build features?

  • When humans perform this task, we look not at arbitrary pixel combinations, but certain geometric patterns -- lines, curves, loops.
  • These features are made up of combinations of pixels, but they are far from arbitrary
  • We identify these features regardless of translation, rotation, etc.

Is there a way to get the network to do the same thing?

I.e., in layer one, identify pixels. Then in layer 2+, identify abstractions over pixels that are translation-invariant 2-D shapes?

We could look at where a "filter" that represents one of these features (e.g., and edge) matches the image.

How would this work?

Convolution

Convolution in the general mathematical sense is define as follows:

The convolution we deal with in deep learning is a simplified case. We want to compare two signals. Here are two visualizations, courtesy of Wikipedia, that help communicate how convolution emphasizes features:


Here's an animation (where we change τ{\tau})

In one sense, the convolution captures and quantifies the pattern matching over space

If we perform this in two dimensions, we can achieve effects like highlighting edges:

The matrix here, also called a convolution kernel, is one of the functions we are convolving. Other convolution kernels can blur, "sharpen," etc.

So we'll drop in a number of convolution kernels, and the network will learn where to use them? Nope. Better than that.

We'll program in the idea of discrete convolution, and the network will learn what kernels extract meaningful features!

The values in a (fixed-size) convolution kernel matrix will be variables in our deep learning model. Although inuitively it seems like it would be hard to learn useful params, in fact, since those variables are used repeatedly across the image data, it "focuses" the error on a smallish number of parameters with a lot of influence -- so it should be vastly less expensive to train than just a huge fully connected layer like we discussed above.

This idea was developed in the late 1980s, and by 1989, Yann LeCun (at AT&T/Bell Labs) had built a practical high-accuracy system (used in the 1990s for processing handwritten checks and mail).

How do we hook this into our neural networks?

  • First, we can preserve the geometric properties of our data by "shaping" the vectors as 2D instead of 1D.

  • Then we'll create a layer whose value is not just activation applied to weighted sum of inputs, but instead it's the result of a dot-product (element-wise multiply and sum) between the kernel and a patch of the input vector (image).

    • This value will be our "pre-activation" and optionally feed into an activation function (or "detector")

If we perform this operation at lots of positions over the image, we'll get lots of outputs, as many as one for every input pixel.

  • So we'll add another layer that "picks" the highest convolution pattern match from nearby pixels, which
    • makes our pattern match a little bit translation invariant (a fuzzy location match)
    • reduces the number of outputs significantly
  • This layer is commonly called a pooling layer, and if we pick the "maximum match" then it's a "max pooling" layer.

The end result is that the kernel or filter together with max pooling creates a value in a subsequent layer which represents the appearance of a pattern in a local area in a prior layer.

Again, the network will be given a number of "slots" for these filters and will learn (by minimizing error) what filter values produce meaningful features. This is the key insight into how modern image-recognition networks are able to generalize -- i.e., learn to tell 6s from 7s or cats from dogs.

Ok, let's build our first ConvNet:

First, we want to explicity shape our data into a 2-D configuration. We'll end up with a 4-D tensor where the first dimension is the training examples, then each example is 28x28 pixels, and we'll explicitly say it's 1-layer deep. (Why? with color images, we typically process over 3 or 4 channels in this last dimension)

A step by step animation follows:

train_libsvm = "/dbfs/databricks-datasets/mnist-digits/data-001/mnist-digits-train.txt"
test_libsvm = "/dbfs/databricks-datasets/mnist-digits/data-001/mnist-digits-test.txt"
 
X_train, y_train = sklearn.datasets.load_svmlight_file(train_libsvm, n_features=784)
X_train = X_train.toarray()
 
X_test, y_test = sklearn.datasets.load_svmlight_file(test_libsvm, n_features=784)
X_test = X_test.toarray()
 
X_train = X_train.reshape( (X_train.shape[0], 28, 28, 1) )
X_train = X_train.astype('float32')
X_train /= 255
y_train = to_categorical(y_train, num_classes=10)
 
X_test = X_test.reshape( (X_test.shape[0], 28, 28, 1) )
X_test = X_test.astype('float32')
X_test /= 255
y_test = to_categorical(y_test, num_classes=10)

Now the model:

from keras.layers import Dense, Dropout, Activation, Flatten, Conv2D, MaxPooling2D
 
model = Sequential()
 
model.add(Conv2D(8, # number of kernels 
                (4, 4), # kernel size
                padding='valid', # no padding; output will be smaller than input
                input_shape=(28, 28, 1)))
 
model.add(Activation('relu'))
 
model.add(MaxPooling2D(pool_size=(2,2)))
 
model.add(Flatten())
model.add(Dense(128))
model.add(Activation('relu')) # alternative syntax for applying activation
 
model.add(Dense(10))
model.add(Activation('softmax'))
 
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])