{0xc00044b600 0xc0004cf0c0} Distributed training approaches :: Distributed training with Amazon SageMaker / Amazon EKS Workshop

Distributed training approaches

approaches

Horovod

(horovod.ai)

Horovod is based on the MPI concepts: size, rank, local rank, allreduce, allgather, and broadcast.

  • Library for distributed deep learning with support for multiple frameworks including TensorFlow
  • Separates infrastructure from ML engineers
  • Uses ring-allreduce and uses Message Passing Interface (MPI) popular in the HPC community
  • Infrastructure services such as Amazon SageMaker and Amazon EKS provides container and MPI environment

allreduce

  1. Forward pass on each device
  2. Backward pass compute gradients
  3. ”All reduce” (average and broadcast) gradients across devices
  4. Update local variables with “all reduced” gradients

Horovod will run the same copy of the script on all hosts/servers/nodes/instances

mpi

horovodrun -np 16 -H server1:4,server2:4,server3:4,server4:4 python training_script.py