0y_mnist-pytorch(Python)

Loading...

ScaDaMaLe Course site and book

The following is from databricks blog with minor adaptations with help from Tilo Wiklund.

Distributed deep learning training using PyTorch with HorovodRunner for MNIST

This notebook demonstrates how to train a model for the MNIST dataset using PyTorch. It first shows how to train the model on a single node, and then shows how to adapt the code using HorovodRunner for distributed training.

Requirements

  • This notebook runs on CPU or GPU clusters.
  • To run the notebook, create a cluster with
    • Two workers

Cluster Specs on databricks

Run on tiny-debug-cluster-(no)gpu or another cluster with the following runtime specifications with CPU/non-GPU and GPU clusters, respectively:

  • Runs on non-GPU cluster with 3 (or more) nodes on 7.4 ML runtime (nodes are 1+2 x m4.xlarge)
  • Runs on GPU cluster with 3 (or more) nodes on 7.4 ML GPU runtime (nodes are 1+2 x g4dn.xlarge)

You do not need to "install" anything else in databricks as everything needed is pre-installed in the runtime environment on the right nodes.

Set up checkpoint location

The next cell creates a directory for saved checkpoint models. Databricks recommends saving training data under dbfs:/ml, which maps to file:/dbfs/ml on driver and worker nodes.

PYTORCH_DIR = '/dbfs/ml/horovod_pytorch'

Prepare single-node code

First you need to have working single-node PyTorch code. This is modified from Horovod's PyTorch MNIST Example.

Define a simple convolutional network

import torch
import torch.nn as nn
import torch.nn.functional as F
 
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 10, kernel_size=5)
        self.conv2 = nn.Conv2d(10, 20, kernel_size=5)
        self.conv2_drop = nn.Dropout2d()
        self.fc1 = nn.Linear(320, 50)
        self.fc2 = nn.Linear(50, 10)
 
    def forward(self, x):
        x = F.relu(F.max_pool2d(self.conv1(x), 2))
        x = F.relu(F.max_pool2d(self.conv2_drop(self.conv2(x)), 2))
        x = x.view(-1, 320)
        x = F.relu(self.fc1(x))
        x = F.dropout(x, training=self.training)
        x = self.fc2(x)
        return F.log_softmax(x)

Configure single-node training

# Specify training parameters
batch_size = 100
num_epochs = 5
momentum = 0.5
log_interval = 100
def train_one_epoch(model, device, data_loader, optimizer, epoch):
    model.train()
    for batch_idx, (data, target) in enumerate(data_loader):
        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()
        output = model(data)
        loss = F.nll_loss(output, target)
        loss.backward()
        optimizer.step()
        if batch_idx % log_interval == 0:
            print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
                epoch, batch_idx * len(data), len(data_loader) * len(data),
                100. * batch_idx / len(data_loader), loss.item()))

Prepare log directory

from time import time
import os
 
LOG_DIR = os.path.join(PYTORCH_DIR, str(time()), 'MNISTDemo')
os.makedirs(LOG_DIR)

Create method for checkpointing and persisting model

def save_checkpoint(model, optimizer, epoch):
  filepath = LOG_DIR + '/checkpoint-{epoch}.pth.tar'.format(epoch=epoch)
  state = {
    'model': model.state_dict(),
    'optimizer': optimizer.state_dict(),
  }
  torch.save(state, filepath)