Skip to yearly menu bar Skip to main content


Poster

Stress-Testing Long-Context Language Models with Lifelong ICL and Task Haystack

Xiaoyue Xu · Qinyuan Ye · Xiang Ren

East Exhibit Hall A-C #2802
[ ] [ Project Page ]
[ Slides [ Poster
Wed 11 Dec 11 a.m. PST — 2 p.m. PST

Abstract:

We propose Lifelong ICL, a problem setting that challenges long-context language models (LMs) to learn various language tasks sequentially through in-context learning (ICL).We further introduce Task Haystack, an evaluation suite designed for assessing and diagnosing how long-context LMs use long contexts.When given a task instruction and test inputs, long-context LMs are expected to leverage the same-task demonstrations in the Lifelong ICL prompt, avoid distraction from other tasks, and achieve a test accuracy no worse than its single-task ICL baseline.Task Haystack draws inspiration from the widely-adopted ``needle-in-a-haystack'' (NIAH) evaluation, but presents new and unique challenges. It demands that models (1) utilize context in a genuinely contextualized manner, rather than resorting to simple copying and pasting; (2) navigate through long streams of evolving topics and tasks, which closely approximates the complexities of real-world scenarios faced by long-context LMs.Additionally, Task Haystack inherits the controllability aspect of NIAH, providing model developers with tools and visualizations to identify model vulnerabilities effectively.We benchmark ten long-context LMs using Task Haystack. We find that state-of-the-art closed models such as GPT-4o still struggle in this setting, failing 12.5\% of the cases on average, while all open models we evaluate further lack behind by a large margin.In our controlled analysis, we find that long-context LMs are susceptible to distractibility and recency bias, and these two factors both contribute to the failure cases. Further, we observe declines in performance when task instructions are paraphrased at test time or when ICL demonstrations are repeated excessively, raising concerns about the robustness, instruction understanding, and true context utilization of current long-context LMs.We release our code and data to encourage further research that addresses these limitations.

Chat is not available.