Oral
in
Workshop: Machine Learning with New Compute Paradigms
Scaling of Optical Transformers
Maxwell Anderson · Shi-Yuan Ma · Tianyu Wang · Logan Wright · Peter McMahon
The rapidly increasing size of deep-learning models has renewed interest in alternatives to digital-electronic computers as a means to dramatically reduce the energy cost of running state-of-the-art neural networks. Optical matrix-vector multipliers are best suited to performing computations with very large operands, which suggests that large Transformer models could be a good target for them. However, the ability of optical accelerators to run efficiently depends on the model being run, and if the model can be run at all when subject to the noise, error, and low precision of analog-optical hardware. Here we investigate whether Transformers meet the criteria to be efficient when running optically, what benefits can be had for doing so, and how worthwhile it is at scale. We found using small-scale experiments on and simulation of a prototype hardware accelerator that Transformers may run on optical hardware, and that elements of their design --- the ability to parallel-process data using the same weights, and trends in scaling them to enormous widths --- allow them to achieve an asymptotic energy-efficiency advantage running optically compared to on digital hardware. Based on a model of a full optical accelerator system, we predict that well-engineered, large-scale optical hardware should be able to achieve a 100× energy-efficiency advantage over current digital-electronic processors in running some of the largest current Transformer models, and if both the models and the optical hardware are scaled to the quadrillion-parameter regime, optical accelerators could have a > 8,000× energy-efficiency advantage.