In this breakout session, we share best practices for harnessing the power of distributed training with PyTorch to accelerate model development and fully utilize GPU clusters.
The session represents lessons learned from across a diverse range of distributed training runs and ultimately shows how to train a 405B sized model using pure PyTorch.
Highlighted best practices include: -Scaling your training code from single GPU to multi node -Diagnostic techniques for quickly identifying cluster issues and freezes during training -Sharding large models with PyTorch FSDP