We will showcase large language model inference on novel hardware appliances using transformer models readily available on HuggingFace. We demonstrate the ease of switching between running your LLMs on standard, conventional NVIDIA systems, and the simple switch-over to running inference on our own Positron hardware. We will demonstrate multiple variants of the Llama large language models, followed by LLaVA, an open-source facsimile of GPT-4Vision, in which audience-submitted images can result in a live semantic captioning demo.
Lastly, we will demonstrate the cost penalties incurred in using incumbent hardware versus the comparative advantage of a solution built out of the box for transformers. We may share a couple techniques we use to efficiently serve high numbers of simultaneous users that are simply not possible on incumbent GPU architectures.