Poster
in
Workshop: Adaptive Foundation Models: Evolving AI for Personalized and Efficient Learning
Extracting Parallelism from Large Language Model Queries
Steven Kolawole · Keshav Santhanam · Pratiksha Thaker · Virginia Smith
Abstract:
Optimization engines for LLM query serving typically focus on workloads with known structure, treating the query itself as a black box. In this work, we investigate extracting parallelization opportunities from individual queries that have decomposable subtasks. Using the LMSYS-chat-1M dataset, we identify three query categories that are amenable to decomposition into parallel LLM calls, and curate a dataset of these queries as a benchmark for this type of within-query parallelization. We develop a prototype system to parallelize these queries and report initial performance results, showing that parallelization can result in a speedup of 5x over serial execution with comparable or even improved generation quality.
Chat is not available.