Parameter-free KV cache compression for memory-efficient long-context LLMs

(arxiv.org)

65 points | by PaulHoule 5 days ago ago

19 comments

kristianp 5 days ago
Code at https://github.com/SusCom-Lab/ZeroMerge
hinkley 5 days ago
This feels like something that could be done in part by hand. We store documents in KV that are often built deterministically by merging two or three pieces of data, one of which is a form of string interpolation (eg, templates).
Effectively if you had a microservice that did extremely light data processing, and you moved the KV store behind it instead of in front of it, you’d achieve a similar aim. A small cache in front of it or even at the upstream would reduce the calculations in the face of thundering herds.
[-]
- vlovich123 5 days ago
  this sounds like you’re thinking KV as in key-value like redis or s3. This paper is about KV cache as in LLM which is for reducing the computational complexity for self-attention. Has nothing to do with what you wrote unless I misunderstood what your wrote (I’m confused what upstream would mean here - the contents of the KV cache are specific to the context provided to the LLM / what the LLM is generating in response).
az226 5 days ago
Is this some joke? They use Llama 2 7B? What year is it?
[-]
- PaulHoule 5 days ago
  The best model is the one you can fit in memory.
  About as soon as GPT-4 came out I said that OpenAI was doomed on the trajectory it was on because they could not afford to develop a GPT-5, GPT-6, etc.
  Real innovation comes out of doing a lot of experiments and that means doing experiments quickly with the resources you have. So you do most of your experiments with non-frontier models, enough to make a good prediction of what would happen if you maxxed out your model size, then you go big. That's how you make everyone else have a "DeepSeek moment".
  A company like Apple wants to pick something on the frontier and keep advancing on a straight line. Works great if you want to make an M1, M2, M3, ... ARM chip but that's not how progress works in AI today.
  [-]
  - hinkley 5 days ago
    Will we see models built on b-trees to deal with memory requirements? Have we already?
    [-]
    - sujayakar 5 days ago
      Deepseek is already using SSDs for their KV cache: https://github.com/deepseek-ai/3FS
      [-]
      - vlovich123 5 days ago
        You are deeply misunderstanding what the KV cache referred to here is. It’s not for storing data. This is the KV cache that’s part of the model to reduce quadratic compute complexity into linear for self attention. This is not stored on SSD - it’s in VRAM (or CPU if you’re not using a GPU)
        [-]
        boroboro4 5 days ago
        They, in fact, mention inference kv cache as use case in readme. The most advanced kv caching uses hierarchy of gpu ram/regular ram/ssd. Seems like they were able to use their storage abstraction for last tier.
        magicalhippo 5 days ago
        https://github.com/deepseek-ai/3FS?tab=readme-ov-file#3-kvca...
        KVCache is a technique used to optimize the LLM inference process. It avoids redundant computations by caching the key and value vectors of previous tokens in the decoder layers. The top figure demonstrates the read throughput of all KVCache clients (1×400Gbps NIC/node), highlighting both peak and average values, with peak throughput reaching up to 40 GiB/s
        [-]
        vlovich123 5 days ago
        That's because DeepSeek uses MLA which apparently does allow offloading the KV cache. That doesn't apply to all models, particularly the open-weight models that are primarily GQA AFAIK.
        [-]
        boroboro4 5 days ago
        Any models allow offloading KV cache, it’s not a matter of model architecture but only of the implementation. The only somewhat difference can be for non transformer models. But for all attention models it’s the same – blob of data per each token. It’s much worse for older models with MHA because their KV cache is just too big, and it’s better for DeepSeek because their KV cache is the smallest. But it’s alright for current generation of GQA models as well.
        [-]
        vlovich123 5 days ago
        Are you sure about that? GQA applies self-attention to every KV cache entry. If you're offloading, then you're having to dynamically page in all the KV cache entries into the GPU which is quite slow since the CPU/GPU link only has so much bandwidth. My understanding is that MLA reduces the size of the KV cache & doesn't necessarily attend to every KV token at every step which is why offloading to disk works (i.e. most of the tokens can remain on disk without ever being loaded into the GPU).
        [-]
        boroboro4 4 days ago
        Offloading in this case doesn’t mean keeping the kv cache on the disk/in storage all the time, it means keeping it there when request isn’t in process of generation. While request being generated kv cache is indeed in vram.
        As for MLA - Deepseek is, just like others, attend to all historical tokens. The only difference instead of having actual KV entries it has lower dimension KV entries, which are being projected into full blown KV entries on the fly during attention. It’s similar to GQA, just instead of just duplication KV entries by size of groups it applies linear transformation.
        [-]
        vlovich123 4 days ago
        Ah OK. So this is for resuming chat context cheaply. What I said is still correct - 3FS is not part of the inference flow & not relevant to the paper which is about optimizing the KV cache usage at runtime.
  - monocasa 5 days ago
    I mean, there's other, better 7B models than Lllama 2 at this point.
- x1000 5 days ago
  If they had experimented using a newer model (gemma 3, deepseek-1 7b, etc.) and reported better results, would that be because their newer baseline model was better than the llama 2 model used in the previous methods' experiments? A more comprehensive study would include results for as many baseline models as possible. But there are likely other researchers in the lab all waiting to use those expensive GPUs for their experiments as well.
  [-]
  - josephg 5 days ago
    Sure. But papers take a really long time to write and go through peer review. I think my paper on collaborative editing took about 4 months from the point where we were done writing to the point at which it appeared on arxiv.
    This research was almost certainly done well before Gemma 3 and Deepseek were released.
- krasin 5 days ago
  > Is this some joke? They use Llama 2 7B? What year is it?
  They use llama2 to demonstrate that their compression method works. There are potential cases:
  1. The method works on all / most LLMs. In this case, it does not matter on which model they demonstrated the effect.
  2. The method only works on llama2, but not on other models. Given that they published the code, I expect that people will quickly test the method on many other models, so we will know that soon. And yet - there would be a scientific significance even if it works only on llama2, as it would mean that there's some special and good in that architecture.
  But I would bet it's #1 - the method works on most of the models and they just picked whatever they had already had code bindings to, to save the effort.