Special Talk
in
Workshop: Machine Learning for Systems
Richard Ho: Navigating Scaling and Efficiency Challenges of ML Systems
Richard Ho
ML progress has been following scaling laws whereby increased computational power enhances intelligence, leading to a marked rise in massive supercomputing clusters. However, the path to greater compute capacity is not solely about providing higher FLOPS; system designers must grapple with the complexities of balancing compute resources, memory capacity and bandwidth, I/O bandwidth, and latency in large-scale systems.
At these scales, the gap between peak theoretical FLOPS and actual utilization becomes significant, underscoring the importance of efficiency. System scaling is further limited by thermal constraints and power delivery challenges. As systems grow, the reliability of components across the entire system becomes critical, leading to a decrease in Mean Time Between Failures. Soft errors, both silent and detectable, magnify these challenges.
When designing future ML computer systems, the interplay of strategies to address these inherent scaling limitations becomes critical.