0x_mnist-tensorflow-keras(Python)

Loading...

ScaDaMaLe Course site and book

The following is from databricks blog with minor adaptations with help from Tilo Wiklund.

Distributed deep learning training using TensorFlow and Keras with HorovodRunner

This notebook demonstrates how to train a model for the MNIST dataset using the tensorflow.keras API. It first shows how to train the model on a single node, and then shows how to adapt the code using HorovodRunner for distributed training.

Requirements

  • This notebook runs on CPU or GPU clusters.
  • To run the notebook, create a cluster with
    • Two workers
    • Databricks Runtime 6.3 ML or above

Cluster Specs on databricks

Run on tiny-debug-cluster-(no)gpu or another cluster with the following runtime specifications with CPU/non-GPU and GPU clusters, respectively:

  • Runs on non-GPU cluster with 3 (or more) nodes on 7.4 ML runtime (nodes are 1+2 x m4.xlarge)
  • Runs on GPU cluster with 3 (or more) nodes on 7.4 ML GPU runtime (nodes are 1+2 x g4dn.xlarge)

You do not need to "install" anything else in databricks as everything needed is pre-installed in the runtime environment on the right nodes.

Set up checkpoint location

The next cell creates a directory for saved checkpoint models.

import os
import time
 
checkpoint_dir = '/dbfs/ml/MNISTDemo/train/{}/'.format(time.time())
 
os.makedirs(checkpoint_dir)

Create function to prepare data

This following cell creates a function that prepares the data for training. This function takes in rank and size arguments so it can be used for both single-node and distributed training. In Horovod, rank is a unique process ID and size is the total number of processes.

This function downloads the data from keras.datasets, distributes the data across the available nodes, and converts the data to shapes and types needed for training.

def get_dataset(num_classes, rank=0, size=1):
  from tensorflow import keras
  
  (x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data('MNIST-data-%d' % rank)
  x_train = x_train[rank::size]
  y_train = y_train[rank::size]
  x_test = x_test[rank::size]
  y_test = y_test[rank::size]
  x_train = x_train.reshape(x_train.shape[0], 28, 28, 1)
  x_test = x_test.reshape(x_test.shape[0], 28, 28, 1)
  x_train = x_train.astype('float32')
  x_test = x_test.astype('float32')
  x_train /= 255
  x_test /= 255
  y_train = keras.utils.to_categorical(y_train, num_classes)
  y_test = keras.utils.to_categorical(y_test, num_classes)
  return (x_train, y_train), (x_test, y_test)

Create function to train model

The following cell defines the model using the tensorflow.keras API. This code is adapted from the Keras MNIST convnet example. The model consists of 2 convolutional layers, a max-pooling layer, two dropout layers, and a final dense layer.

def get_model(num_classes):
  from tensorflow.keras import models
  from tensorflow.keras import layers
  
  model = models.Sequential()
  model.add(layers.Conv2D(32, kernel_size=(3, 3),
                   activation='relu',
                   input_shape=(28, 28, 1)))
  model.add(layers.Conv2D(64, (3, 3), activation='relu'))
  model.add(layers.MaxPooling2D(pool_size=(2, 2)))
  model.add(layers.Dropout(0.25))
  model.add(layers.Flatten())
  model.add(layers.Dense(128, activation='relu'))
  model.add(layers.Dropout(0.5))
  model.add(layers.Dense(num_classes, activation='softmax'))
  return model

Run training on single node

At this point, you have created functions to load and preprocess the dataset and to create the model.
This section illustrates single-node training code using tensorflow.keras.

# Specify training parameters
batch_size = 128
epochs = 5
num_classes = 10        
 
 
def train(learning_rate=1.0):
  from tensorflow import keras
  
  (x_train, y_train), (x_test, y_test) = get_dataset(num_classes)
  model = get_model(num_classes)
 
  # Specify the optimizer (Adadelta in this example), using the learning rate input parameter of the function so that Horovod can adjust the learning rate during training
  optimizer = keras.optimizers.Adadelta(lr=learning_rate)
 
  model.compile(optimizer=optimizer,
                loss='categorical_crossentropy',
                metrics=['accuracy'])
 
  model.fit(x_train, y_train,
            batch_size=batch_size,
            epochs=epochs,
            verbose=2,
            validation_data=(x_test, y_test))

Run the train function you just created to train a model on the driver node. The process takes several minutes. The accuracy improves with each epoch.

# Runs in  23.67 seconds on 3-node     GPU
# Runs in 418.8  seconds on 3-node non-GPU
train(learning_rate=0.1)
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz 8192/11490434 [..............................] - ETA: 0s 319488/11490434 [..............................] - ETA: 1s 679936/11490434 [>.............................] - ETA: 1s 1024000/11490434 [=>............................] - ETA: 1s 1409024/11490434 [==>...........................] - ETA: 1s 1802240/11490434 [===>..........................] - ETA: 1s 2195456/11490434 [====>.........................] - ETA: 1s 2588672/11490434 [=====>........................] - ETA: 1s 2981888/11490434 [======>.......................] - ETA: 1s 3358720/11490434 [=======>......................] - ETA: 1s 3866624/11490434 [=========>....................] - ETA: 1s 4440064/11490434 [==========>...................] - ETA: 0s 5160960/11490434 [============>.................] - ETA: 0s 6094848/11490434 [==============>...............] - ETA: 0s 7274496/11490434 [=================>............] - ETA: 0s 8683520/11490434 [=====================>........] - ETA: 0s 10436608/11490434 [==========================>...] - ETA: 0s 11493376/11490434 [==============================] - 1s 0us/step Epoch 1/5 469/469 - 3s - loss: 0.6257 - accuracy: 0.8091 - val_loss: 0.2157 - val_accuracy: 0.9345 Epoch 2/5 469/469 - 3s - loss: 0.2950 - accuracy: 0.9127 - val_loss: 0.1450 - val_accuracy: 0.9569 Epoch 3/5 469/469 - 3s - loss: 0.2145 - accuracy: 0.9373 - val_loss: 0.1035 - val_accuracy: 0.9695 Epoch 4/5 469/469 - 3s - loss: 0.1688 - accuracy: 0.9512 - val_loss: 0.0856 - val_accuracy: 0.9738 Epoch 5/5 469/469 - 3s - loss: 0.1379 - accuracy: 0.9600 - val_loss: 0.0701 - val_accuracy: 0.9788

Migrate to HorovodRunner for distributed training

This section shows how to modify the single-node code to use Horovod. For more information about Horovod, see the Horovod documentation.