Distributed Deep Learning
CNN's with horovod, MLFlow and hypertuning through SparkTrials
William Anzén (Linkedin), Christian von Koch (Linkedin)
2021, Stockholm, Sweden
This project was supported by Combient Mix AB under the industrial supervision of Razesh Sainudiin and Max Fischer.
However, all the neuromusculature credit goes to William and Christian, they absorbed the WASP PhD course over their X-mas holidays
Caveat These notebooks were done an databricks shard on Azure, as opposed to AWS.
So one has to take some care up to databricks' terraform pipes. Loading data should be independent of the underlying cloud-provider as the data is loaded through Tensorflow Datasets, although the following notebooks have not been tested on this AWS infrastructure with their kind support of a total of USD 7,500 AWS credits through The databricks University Alliance which waived the DBU-units on a professional enterprise-grade shard for WASP SCadaMaLe/sds-3-x course with voluntary research students at any Swedish University who go through the curriculum first. Raazesh Sainudiin is most grateful to Rob Reed for the most admirable administration of The databricks University Alliance.
** Resources: **
These notebooks were inspired by Tensorflow's tutorial on Image Segmentation.
01aimagesegmentation_unet
In this chapter a simple U-Net architecture is implemented and evaluated against the Oxford Pets Data set. The model achieves a validation accuracy of 81.51% and a validation loss of 0.7251 after 38/50 epochs (3.96 min full 50 epochs).
exjobbsOfCombientMix202102aimagesegmenationpspnet
In this chapter a PSPNet architecture is implemented and evaluated against the Oxford Pets Data set. The model achieves a validation accuracy of 90.30% and a validation loss of 0.3145 after 42/50 epochs (39.64 min full 50 epochs).
exjobbsOfCombientMix202103aimagesegmenationicnet
In this chapter the ICNet architecture is implemented and evaluated against the Oxford Pets Data set. The model achieves a validation accuracy of 86.64% and a validation loss of 0.3750 after 31/50 epochs (12.50 min full 50 epochs).
exjobbsOfCombientMix202104apspnettuningparallel
In this chapter we run hyperparameter tuning with hyperopt & SparkTrials allowing the hyperparameter tuning to be made in parallel across multiple workers. Achieved 0.56 loss with parameters({'batchsize': 16, 'learningrate': 0.0001437661898681224}) (1.56 hours - 4 workers)
exjobbsOfCombientMix202105pspnet_horovod
In this chapter we add horovod to the notebook, allowing distributed training of the model. Achieved a validation accuracy of 89.87% and validation loss of loss: 0.2861 after 49/50 epochs (33.93 min - 4 workers).