{0xc00044b600 0xc0004cf0c0} Problem statement :: Distributed training with Amazon SageMaker / Amazon EKS Workshop

Problem statement

Converting a single CPU/GPU training script to a multi-node/distributed compatible training script

Frameworks: This workshop currently uses TensorFlow 1.14, Keras and Horovod 0.18.

Dataset: The CIFAR-10 consists of 60,000 32x32 images belonging to 10 different classes (6,000 images per class).
CIFAR-10 dataset includes:

  • 40,000 images for training
  • 10,000 images for validation
  • 10,000 images for test

Here are the classes in the dataset, as well as 10 random images from each: cifar10

Note: Although the dataset is small and this is a simpler problem, all the steps we’ll take can easily be applied to large datasets that don’t fit in memory. Amazon SageMaker has native pipe-mode support to stream dataset directly from S3 to the training instances. With Amazon EKS, we’ll setup an Amazon FSx for lustre file system that’s accessible to every worker.