Poster
in
Workshop: Workshop on Advancing Neural Network Training (WANT): Computational Efficiency, Scalability, and Resource Optimization
DYAD: A Descriptive Yet Abjuring Density efficient approximation to linear neural network layers
Sarin Chandy · Varun Prashant Gangal · Yi Yang · Gabriel Maggiotti
We devise, implement and performance-asses DYAD, a layer which can serve asa faster and more memory-efficient approximate replacement for linear layers,(nn.Linear() in Pytorch). These layers appear in common subcomponents, such asin the ff module of Transformers. DYAD is based on a bespoke near-sparse matrixstructure which approximates the dense "weight" matrix W that matrix-multiplies the input in the typical realization of such a layer, a.k.a DENSE. Our alternative near-sparse matrix structure is decomposable to a sum of 2 matrices permutable to ablock-sparse counterpart. These can be represented as 3D tensors, which in unisonallow a faster execution of matrix multiplication with the mini-batched input matrixcompared to DENSE (O(rows(W) × cols(W)) → O(rows(W)×cols(W)/ (# of blocks )). Asthe crux of our experiments, we pretrain both DYAD and DENSE variants of 2 sizesof the OPT arch and 1 size of the Pythia arch, including at different token scalesof the babyLM benchmark. We find DYAD to be competitive (≥ 90%) of DENSEperformance on zero-shot (e.g. BLIMP), few-shot (OPENLM) and finetuning(GLUE) benchmarks, while being ≥7-15% faster to train on-GPU even at 125mscale, besides surfacing larger speedups at increasing scale and model width.