Distributed Deep Learning

CNN's with horovod, MLFlow and hypertuning through SparkTrials

William Anzén (Linkedin), Christian von Koch (Linkedin)

2021, Stockholm, Sweden

This project was supported by Combient Mix AB under the industrial supervision of Razesh Sainudiin and Max Fischer.

However, all the neuromusculature credit goes to William and Christian, they absorbed the WASP PhD course over their X-mas holidays

Caveat These notebooks were done an databricks shard on Azure, as opposed to AWS.

So one has to take some care up to databricks' terraform pipes. Loading data should be independent of the underlying cloud-provider as the data is loaded through Tensorflow Datasets, although the following notebooks have not been tested on this AWS infrastructure with their kind support of a total of USD 7,500 AWS credits through The databricks University Alliance which waived the DBU-units on a professional enterprise-grade shard for WASP SCadaMaLe/sds-3-x course with voluntary research students at any Swedish University who go through the curriculum first. Raazesh Sainudiin is most grateful to Rob Reed for the most admirable administration of The databricks University Alliance.

** Resources: **

These notebooks were inspired by Tensorflow's tutorial on Image Segmentation.

01aimagesegmentation_unet

In this chapter a simple U-Net architecture is implemented and evaluated against the Oxford Pets Data set. The model achieves a validation accuracy of 81.51% and a validation loss of 0.7251 after 38/50 epochs (3.96 min full 50 epochs).

exjobbsOfCombientMix202102aimagesegmenationpspnet

In this chapter a PSPNet architecture is implemented and evaluated against the Oxford Pets Data set. The model achieves a validation accuracy of 90.30% and a validation loss of 0.3145 after 42/50 epochs (39.64 min full 50 epochs).

exjobbsOfCombientMix202103aimagesegmenationicnet

In this chapter the ICNet architecture is implemented and evaluated against the Oxford Pets Data set. The model achieves a validation accuracy of 86.64% and a validation loss of 0.3750 after 31/50 epochs (12.50 min full 50 epochs).

exjobbsOfCombientMix202104apspnettuningparallel

In this chapter we run hyperparameter tuning with hyperopt & SparkTrials allowing the hyperparameter tuning to be made in parallel across multiple workers. Achieved 0.56 loss with parameters({'batchsize': 16, 'learningrate': 0.0001437661898681224}) (1.56 hours - 4 workers)

exjobbsOfCombientMix202105pspnet_horovod

In this chapter we add horovod to the notebook, allowing distributed training of the model. Achieved a validation accuracy of 89.87% and validation loss of loss: 0.2861 after 49/50 epochs (33.93 min - 4 workers).