{0xc00044b600 0xc0004cf0c0} Introduction :: Distributed training with Amazon SageMaker / Amazon EKS Workshop

Introduction

In a typical machine learning development workflow, there are two main stages where you can get benefit from scaling out.

parallel distributed

  1. Running large-scale parallel experiments: In this scenario our goal is to find the best model/hyperparameters/network architecture by exploring a space of possibilities.
  2. Running distributed training of a single model: In this scenario our goal is to train a single model faster, by distributing its computation across nodes in a cluster.

The focus of this workshop is distributed training of a single model