Today’s state of deep neural network inference can be summed up with two words: complex and inefficient. The quest for accuracy has led to overparameterized deep neural networks that require heavy compute resources to solve tasks at hand, and as such we are “rapidly approaching outrageous computational, economic, and environmental costs to gain incrementally smaller improvements in model performance (State of AI Report 2020).” Furthermore, there is no lack of research on achieving high levels of unstructured sparsity, but putting that research into practice remains a challenge. As a result, data scientists and machine learning engineers are often forced to make tradeoffs between model performance, accuracy, and inference costs.
There is a better way.
After years of research at MIT, the team at Neural Magic concluded that throwing teraflops at dense models is not sustainable. So we've taken the best of known research on model compression (unstructured pruning and quantization, in particular) and efficient sparse execution to build a software solution that delivers efficient deep neural network inference on everyday CPUs, without the need for specialized hardware.
Join Neural Magic ML experts to learn how we successfully applied published research on model compression and efficient sparse execution to built software that compresses and optimize deep learning models for efficient inference with ease.
You’ll walk away with an overview of: SOTA model compression techniques; A demo of the first-ever general-purpose inference engine that translates high sparsity levels into significant speedup, and Next steps on using the Neural Magic Inference engine and ML tools to make your inference efficient, with less complexity.