NeurIPS Extracting Parallelism from Large Language Model Queries

Poster
in
Workshop: Adaptive Foundation Models: Evolving AI for Personalized and Efficient Learning

Extracting Parallelism from Large Language Model Queries

Steven Kolawole · Keshav Santhanam · Pratiksha Thaker · Virginia Smith

[ Abstract ] [ Project Page ]

[ OpenReview]

Abstract:

Optimization engines for LLM query serving typically focus on workloads with known structure, treating the query itself as a black box. In this work, we investigate extracting parallelization opportunities from individual queries that have decomposable subtasks. Using the LMSYS-chat-1M dataset, we identify three query categories that are amenable to decomposition into parallel LLM calls, and curate a dataset of these queries as a benchmark for this type of within-query parallelization. We develop a prototype system to parallelize these queries and report initial performance results, showing that parallelization can result in a speedup of 5x over serial execution with comparable or even improved generation quality.

Chat is not available.

Poster in Workshop: Adaptive Foundation Models: Evolving AI for Personalized and Efficient Learning

Extracting Parallelism from Large Language Model Queries

Steven Kolawole · Keshav Santhanam · Pratiksha Thaker · Virginia Smith

Poster
in
Workshop: Adaptive Foundation Models: Evolving AI for Personalized and Efficient Learning