NoLiMa: Long-Context Evaluation Beyond Literal Matching

(arxiv.org)

1 points | by consumer451 19 hours ago ago

1 comments

consumer451 19 hours ago
This is an extremely valuable piece of information for me:
As context increases, reliability decreases. It's not a binary thing at all, as I had assumed. This is very important to know when using Windsurf, Cursor, etc. I'm going to start new conversations much more often.
> For instance, the NoLiMa benchmark revealed that models like GPT-4o experienced a significant drop from a 99.3% performance rate at 1,000 tokens to 69.7% at 32,000 tokens. Similarly, Llama 3.3 70B's effectiveness decreased from 97.3% at 1,000 tokens to 42.7% at 32,000 tokens, highlighting the challenges LLMs face with longer contexts.
https://digialps.com/nolima-reveals-llm-performance-drops-be...