Understanding R1-Zero-Like Training: A Critical Perspective

(github.com)

160 points | by pama 7 months ago ago

22 comments

mentalgear 7 months ago
Overall the industry needs more review, less hype. I was shocked to find out SWE-verified [0] is all but verified.
[0] benchmark used by all major vendors to "showcase" coding ability, turns out to be <10% properly solved: https://www.youtube.com/watch?v=QnOc_kKKuac
[-]
- belter 7 months ago
  Failure modes are also interesting to show what is happening or not really happening. Like the test of asking GenAI to create clocks at specific times, or people drawing with the left hand. All you get are clocks, at 10 min past two, or people drawing with the right hand, since it's 99% of what is in the training data.
  Like Sabine says, if the LLM models, already read all the Math books in the world but are not yet able to do basic math, without calling upon a calculator, how much reasoning is really emerging?
  "The Path to AGI is Coming Into View": https://youtu.be/mfbRHhOCgzs?t=219
  [-]
  - omneity 7 months ago
    I would offer an alternative, possible explanation.
    Having trained quite a few LLMs by now, especially around the “uplift” from a text completion model to an instruct one, I noticed that the instruction following capabilities tend not to be uniform across all the tasks the LLM is able to perform.
    In other words, it doesn’t know what is the implication of your request (time on the clock) in terms of change in output (the svg drawing of the clock), and all the abstractions and steps in between, this is why changing the time in the prompt might yield little difference.
    Instruct datasets are quite small in retrospect and hardly cover the range of tasks instruct LLMs are able to perform, and mostly rely on extrapolation (interpolation?) capabilities for the LLM from its small training set. And it is this generalization that is not evenly distributed.
    Just my two cents from my modest observations, sprinkle [citation needed] everywhere.
  - refulgentis 7 months ago
    I'm not sure what Sabine means. It is a somewhat obvious category error, and fundamentally in error regardless. (I find it hard to believe that, for example, Sabine would beat an LLM on a randomly selection of 10 3x3 digit multiplication problem to be complicated in 60 seconds max, by either party)
  - immibis 7 months ago
    Or overflowing wine glasses.
  - fragmede 7 months ago
    numeracy isn't mathmatical reasoning
drakenot 7 months ago
I've seen the same "Superficial Self-Reflection" mentioned in their linked blog post[0] as well, where the conclusion doesn't naturally follow the output of the thinking tokens. I think people are fooled by this, but if you take the time to inspect the "chain of thought" tokens they often don't match the final output answer.
I don't deny that performance for certain logic tasks goes up with these models but I don't fully understand what role the thinking tokens take in these cases.
[0] https://oatllm.notion.site/oat-zero
[-]
- andai 7 months ago
  I heard that even just getting the model to print a bunch of whitespace ("think for longer") improves the quality of the final response, because some kind of processing is still happening internally?
  [-]
  - MoonGhost 7 months ago
    Could it be that model just uses latent space for thinking while generating almost garbage? Interesting to check if adding repeating something at the end of prompt helps. I.e. model uses it for 'thinking'.
    [-]
    - refulgentis 7 months ago
      Latent space has a certain shape, which may mean I'm missing a technical distinction.*
      There's been publications with a pause token (https://arxiv.org/abs/2310.02226), backspace token (https://arxiv.org/abs/2306.05426), or a think token (https://arxiv.org/html/2405.08644v1#Ch0.S4), all of them based on the theory that a generic token can sort of act a placeholder for manipulating attention further without further meaningful output.
      However, in practice, those approaches haven't been used in the training of a large scale model, i.e. I haven't seen it at all, the most adventurous people have gotten at scale is doing Mamba. (and RL)
      * It had a particular technical meaning. The first round of the telephone game was when it came to mean "a 3 spatial dimensions-like space, with N dimensions, in an image diffusion model, that contains all possible image styles, that is navigated by a prompt." We're many iterations afield of it now, I'm afraid. Now, you sort of have to interpret it like you would negative space, defined by what it is around it.
  - andai 7 months ago
    Here's one where just appending dots to the output improved quality:
    https://arxiv.org/abs/2404.15758
  - integralof6y 7 months ago
    printing a bunch of whitespace is a way of entering into a new state ( I am thinking about a state machine), so the LLM can use that whitespace as a new token that can be used later to refine the state of the system. In math terms, whitespace is a tag for a class (or state) in the LLM. I think that perhaps RL can take advantage of such tags. For example whitespace could indicate a point of low gradient (indetermination) or a branching point, the LLM in some way would learn to enhance the learning rate parameter, so the message in the head of the LLM is: be ready to learn from RL because in your actual state you need to take a branch from a branching point that can enhance your capabilities. This is similar to tossing a coin or a die. The rule could be: when whitespace do increase the learning rate parameter to escape from zero gradient points. Caveat emptor: This is just an speculation, I don't have any data to support this hypothesis. Also this suggests that whitespace could be a "token that reflects the state of previous layers" and is not contained in the vocabulary used to train the model, so I should say that whitespace is a macro-token or neurotoken. If this hypothesis has some ground then it could also be plausible that whitespace could be an enumerate neural tag in the sense that the length of whitespace reflects or is related to the layer in which the zero gradient or branching point occurs. Finally, my throwaway user need whitespace so I will change the password to a random one to force myself to avoid adding new ideas.
- YeGoblynQueenne 7 months ago
  More tokens = fewer options for the final string. It's not any more complicated than that and it doesn't require any reasoning, just an autoregressively trained statistical model of text, but no, it has to be "the model thinks harder if it outputs more tokens".
scribu 7 months ago
If the base models already have the “reasoning” capability, as they claim, then it’s not surprising that they were able to get to SOTA using a relatively negligible amount of compute for RL fine-tuning.
I love this sort of “anti-hype” research. We need more of it.
fancyfredbot 7 months ago
The article starts by saying
"DeepSeek-V3-Base already exhibit 'Aha moment'."
I tried to read the screenshot they present as evidence of this, and indeed it does say "Aha!". But both the preceding reasoning and the following conclusion look like gibberish to me. I'm not sure what we're supposed to conclude here and I gave up reading the article after this inauspicious start.
mirekrusin 7 months ago
So they achived R1-Zero like performance without those long CoT that sometimes never end/are impacting inference time with fraction of fine tunining resources?
[-]
- refulgentis 7 months ago
  No, they still have "<think>", but it's shorter by removing part of a term.
  [-]
  - mirekrusin 7 months ago
    That's what I mean, those CoT are never ending currently until you run out of context.
    [-]
    - refulgentis 7 months ago
      I'm not sure if you're talking conversationally and I'm taking it as a technical query, or you're saying CoT never terminate for you and asking for input, or asking what the paper implies about CoT, or relaying that you understand the papers claim that this method net reduces CoT length.
      [-]
      - mirekrusin 7 months ago
        Verbose, non terminating CoT is currently common problem with open weight models based on R1-zero methods.
        Currently it seems that this shift of cost to inference-time is a necessary tradeoff that we have to live with (for now at least).
        It's more of a problem for many people running those models locally because they have constrained hardware that can't handle those long contexts.
        It seems that this paper shows not only that their method is cheaper in terms of fine tuning but also significantly reduces inference time cost for CoTs.
        If what they say gets confirmed it looks to me like quite significant contribution?
ilrwbwrkhv 7 months ago
I mean anybody who understands the maths knows that there is no real reasoning from these models. The only ones hyping this up are VC bros who want the return on their investment money.
blahhh2525 7 months ago
[flagged]