Recent advances in deep learning have led to substantial gains in various fields. With the increase of datasets and model size, it is a common practice to speed up the training workload by using data parallel (DP). However, in our practice with EasyTransfer (a modeling framework designed for NLP developers to implement algorithms with usability and flexibility), DP loses its magic for giant models that cannot fit into single GPU memory. Moreover, for different model architectures, it is nontrivial to find an efficient parallel strategy that can make full use of the resources.
To address the above challenges, we present Whale, a unified distributed training framework that can boost AI training tasks with usability and efficiency. It provides comprehensive parallel strategies including data parallel, model parallel, operator splitting, pipeline, hybrid strategy, and automatic parallel strategy. As far as we know, this is the first work that supports various distributed strategies within one framework. To effectively express different training strategies, a new intermediate representation of models is designed for distributed paradigms and execution cost. The automatic parallel strategy is generated upon using the cost model. The parallelization of execution is built by editing the computational graph, which can be applied to different training frameworks. Whale can easily distribute training tasks by adding a few code lines without changing user model code.
In our experiment of BertLarge model that built with EasyTransfer, Whale pipeline strategy attains 57% speedup when compared to Horovod data parallel strategy (HDP) on 64 GPUs. In the large-scale image classification task (100,000 classes), Whale hybrid strategy, which consists of operator splitting and DP, is 14.8 times faster than HDP on 64 GPUs. For models that cannot fit into the GPU device memory, Whale enables the training of T5 and image classification tasks with 100 billions of classes.