PhD Defense: Optimizing Communication in Parallel Deep Learning on Exascale-class Supercomputers
IRB IRB-5237
Deep learning has made significant advancements across various fields, driven by increasingly larger neural networks and massive datasets. However, these improvements come at the cost of high computational demands, necessitating the use of thousands of GPUs operating in parallel for extreme-scale model training. At such scales, the overheads associated with inter-GPU communication become a major bottleneck, severely limiting efficient hardware resource utilization.
This thesis addresses the challenge of optimizing communication in large-scale parallel deep learning. First, it introduces a novel four-dimensional hybrid parallel algorithm designed to minimize communication overhead while maintaining ease of use for practitioners. Second, it presents a topology-aware communication model that identifies optimal configurations for this algorithm based on the hardware architecture, improving efficiency and scalability. Finally, the thesis develops highly scalable implementations of collective communication primitives commonly used in distributed deep learning, further enhancing performance. Put together, these optimizations enable us to efficiently scale LLM training to more than 16000 GPUs on the Frontier supercomputer, and clock a significantly high throughput of nearly 1.4 exaFLOP/s.