Failure modes are also interesting to show what is happening or not really happening. Like the test of asking GenAI to create clocks at specific times, or people drawing with the left hand. All you get are clocks, at 10 min past two, or people drawing with the right hand, since it's 99% of what is in the training data.
Like Sabine says, if the LLM models, already read all the Math books in the world but are not yet able to do basic math, without calling upon a calculator, how much reasoning is really emerging?
I would offer an alternative, possible explanation.
Having trained quite a few LLMs by now, especially around the “uplift” from a text completion model to an instruct one, I noticed that the instruction following capabilities tend not to be uniform across all the tasks the LLM is able to perform.
In other words, it doesn’t know what is the implication of your request (time on the clock) in terms of change in output (the svg drawing of the clock), and all the abstractions and steps in between, this is why changing the time in the prompt might yield little difference.
Instruct datasets are quite small in retrospect and hardly cover the range of tasks instruct LLMs are able to perform, and mostly rely on extrapolation (interpolation?) capabilities for the LLM from its small training set. And it is this generalization that is not evenly distributed.
Just my two cents from my modest observations, sprinkle [citation needed] everywhere.
I'm not sure what Sabine means. It is a somewhat obvious category error, and fundamentally in error regardless. (I find it hard to believe that, for example, Sabine would beat an LLM on a randomly selection of 10 3x3 digit multiplication problem to be complicated in 60 seconds max, by either party)
I've seen the same "Superficial Self-Reflection" mentioned in their linked blog post[0] as well, where the conclusion doesn't naturally follow the output of the thinking tokens. I think people are fooled by this, but if you take the time to inspect the "chain of thought" tokens they often don't match the final output answer.
I don't deny that performance for certain logic tasks goes up with these models but I don't fully understand what role the thinking tokens take in these cases.
I heard that even just getting the model to print a bunch of whitespace ("think for longer") improves the quality of the final response, because some kind of processing is still happening internally?
Could it be that model just uses latent space for thinking while generating almost garbage? Interesting to check if adding repeating something at the end of prompt helps. I.e. model uses it for 'thinking'.
However, in practice, those approaches haven't been used in the training of a large scale model, i.e. I haven't seen it at all, the most adventurous people have gotten at scale is doing Mamba. (and RL)
* It had a particular technical meaning. The first round of the telephone game was when it came to mean "a 3 spatial dimensions-like space, with N dimensions, in an image diffusion model, that contains all possible image styles, that is navigated by a prompt." We're many iterations afield of it now, I'm afraid. Now, you sort of have to interpret it like you would negative space, defined by what it is around it.
printing a bunch of whitespace is a way of entering into a new state ( I am thinking about a state machine), so the LLM can use that whitespace as a new token that can be used later to refine the state of the system. In math terms, whitespace is a tag for a class (or state) in the LLM. I think that perhaps RL can take advantage of such tags. For example whitespace could indicate a point of low gradient (indetermination) or a branching point, the LLM in some way would learn to enhance the learning rate parameter, so the message in the head of the LLM is: be ready to learn from RL because in your actual state you need to take a branch from a branching point that can enhance your capabilities. This is similar to tossing a coin or a die. The rule could be: when whitespace do increase the learning rate parameter to escape from zero gradient points. Caveat emptor: This is just an speculation, I don't have any data to support this hypothesis. Also this suggests that whitespace could be a "token that reflects the state of previous layers" and is not contained in the vocabulary used to train the model, so I should say that whitespace is a macro-token or neurotoken. If this hypothesis has some ground then it could also be plausible that whitespace could be an enumerate neural tag in the sense that the length of whitespace reflects or is related to the layer in which the zero gradient or branching point occurs.
Finally, my throwaway user need whitespace so I will change the password to a random one to force myself to avoid adding new ideas.
More tokens = fewer options for the final string. It's not any more complicated than that and it doesn't require any reasoning, just an autoregressively trained statistical model of text, but no, it has to be "the model thinks harder if it outputs more tokens".
If the base models already have the “reasoning” capability, as they claim, then it’s not surprising that they were able to get to SOTA using a relatively negligible amount of compute for RL fine-tuning.
I love this sort of “anti-hype” research. We need more of it.
I tried to read the screenshot they present as evidence of this, and indeed it does say "Aha!". But both the preceding reasoning and the following conclusion look like gibberish to me. I'm not sure what we're supposed to conclude here and I gave up reading the article after this inauspicious start.
So they achived R1-Zero like performance without those long CoT that sometimes never end/are impacting inference time with fraction of fine tunining resources?
I'm not sure if you're talking conversationally and I'm taking it as a technical query, or you're saying CoT never terminate for you and asking for input, or asking what the paper implies about CoT, or relaying that you understand the papers claim that this method net reduces CoT length.
Verbose, non terminating CoT is currently common problem with open weight models based on R1-zero methods.
Currently it seems that this shift of cost to inference-time is a necessary tradeoff that we have to live with (for now at least).
It's more of a problem for many people running those models locally because they have constrained hardware that can't handle those long contexts.
It seems that this paper shows not only that their method is cheaper in terms of fine tuning but also significantly reduces inference time cost for CoTs.
If what they say gets confirmed it looks to me like quite significant contribution?
I mean anybody who understands the maths knows that there is no real reasoning from these models. The only ones hyping this up are VC bros who want the return on their investment money.
Overall the industry needs more review, less hype. I was shocked to find out SWE-verified [0] is all but verified.
[0] benchmark used by all major vendors to "showcase" coding ability, turns out to be <10% properly solved: https://www.youtube.com/watch?v=QnOc_kKKuac
Failure modes are also interesting to show what is happening or not really happening. Like the test of asking GenAI to create clocks at specific times, or people drawing with the left hand. All you get are clocks, at 10 min past two, or people drawing with the right hand, since it's 99% of what is in the training data.
Like Sabine says, if the LLM models, already read all the Math books in the world but are not yet able to do basic math, without calling upon a calculator, how much reasoning is really emerging?
"The Path to AGI is Coming Into View": https://youtu.be/mfbRHhOCgzs?t=219
I would offer an alternative, possible explanation.
Having trained quite a few LLMs by now, especially around the “uplift” from a text completion model to an instruct one, I noticed that the instruction following capabilities tend not to be uniform across all the tasks the LLM is able to perform.
In other words, it doesn’t know what is the implication of your request (time on the clock) in terms of change in output (the svg drawing of the clock), and all the abstractions and steps in between, this is why changing the time in the prompt might yield little difference.
Instruct datasets are quite small in retrospect and hardly cover the range of tasks instruct LLMs are able to perform, and mostly rely on extrapolation (interpolation?) capabilities for the LLM from its small training set. And it is this generalization that is not evenly distributed.
Just my two cents from my modest observations, sprinkle [citation needed] everywhere.
I'm not sure what Sabine means. It is a somewhat obvious category error, and fundamentally in error regardless. (I find it hard to believe that, for example, Sabine would beat an LLM on a randomly selection of 10 3x3 digit multiplication problem to be complicated in 60 seconds max, by either party)
Or overflowing wine glasses.
numeracy isn't mathmatical reasoning
I've seen the same "Superficial Self-Reflection" mentioned in their linked blog post[0] as well, where the conclusion doesn't naturally follow the output of the thinking tokens. I think people are fooled by this, but if you take the time to inspect the "chain of thought" tokens they often don't match the final output answer.
I don't deny that performance for certain logic tasks goes up with these models but I don't fully understand what role the thinking tokens take in these cases.
[0] https://oatllm.notion.site/oat-zero
I heard that even just getting the model to print a bunch of whitespace ("think for longer") improves the quality of the final response, because some kind of processing is still happening internally?
Here's one where just appending dots to the output improved quality:
https://arxiv.org/abs/2404.15758
Could it be that model just uses latent space for thinking while generating almost garbage? Interesting to check if adding repeating something at the end of prompt helps. I.e. model uses it for 'thinking'.
Latent space has a certain shape, which may mean I'm missing a technical distinction.*
There's been publications with a pause token (https://arxiv.org/abs/2310.02226), backspace token (https://arxiv.org/abs/2306.05426), or a think token (https://arxiv.org/html/2405.08644v1#Ch0.S4), all of them based on the theory that a generic token can sort of act a placeholder for manipulating attention further without further meaningful output.
However, in practice, those approaches haven't been used in the training of a large scale model, i.e. I haven't seen it at all, the most adventurous people have gotten at scale is doing Mamba. (and RL)
* It had a particular technical meaning. The first round of the telephone game was when it came to mean "a 3 spatial dimensions-like space, with N dimensions, in an image diffusion model, that contains all possible image styles, that is navigated by a prompt." We're many iterations afield of it now, I'm afraid. Now, you sort of have to interpret it like you would negative space, defined by what it is around it.
printing a bunch of whitespace is a way of entering into a new state ( I am thinking about a state machine), so the LLM can use that whitespace as a new token that can be used later to refine the state of the system. In math terms, whitespace is a tag for a class (or state) in the LLM. I think that perhaps RL can take advantage of such tags. For example whitespace could indicate a point of low gradient (indetermination) or a branching point, the LLM in some way would learn to enhance the learning rate parameter, so the message in the head of the LLM is: be ready to learn from RL because in your actual state you need to take a branch from a branching point that can enhance your capabilities. This is similar to tossing a coin or a die. The rule could be: when whitespace do increase the learning rate parameter to escape from zero gradient points. Caveat emptor: This is just an speculation, I don't have any data to support this hypothesis. Also this suggests that whitespace could be a "token that reflects the state of previous layers" and is not contained in the vocabulary used to train the model, so I should say that whitespace is a macro-token or neurotoken. If this hypothesis has some ground then it could also be plausible that whitespace could be an enumerate neural tag in the sense that the length of whitespace reflects or is related to the layer in which the zero gradient or branching point occurs. Finally, my throwaway user need whitespace so I will change the password to a random one to force myself to avoid adding new ideas.
More tokens = fewer options for the final string. It's not any more complicated than that and it doesn't require any reasoning, just an autoregressively trained statistical model of text, but no, it has to be "the model thinks harder if it outputs more tokens".
If the base models already have the “reasoning” capability, as they claim, then it’s not surprising that they were able to get to SOTA using a relatively negligible amount of compute for RL fine-tuning.
I love this sort of “anti-hype” research. We need more of it.
The article starts by saying
"DeepSeek-V3-Base already exhibit 'Aha moment'."
I tried to read the screenshot they present as evidence of this, and indeed it does say "Aha!". But both the preceding reasoning and the following conclusion look like gibberish to me. I'm not sure what we're supposed to conclude here and I gave up reading the article after this inauspicious start.
So they achived R1-Zero like performance without those long CoT that sometimes never end/are impacting inference time with fraction of fine tunining resources?
No, they still have "<think>", but it's shorter by removing part of a term.
That's what I mean, those CoT are never ending currently until you run out of context.
I'm not sure if you're talking conversationally and I'm taking it as a technical query, or you're saying CoT never terminate for you and asking for input, or asking what the paper implies about CoT, or relaying that you understand the papers claim that this method net reduces CoT length.
Verbose, non terminating CoT is currently common problem with open weight models based on R1-zero methods.
Currently it seems that this shift of cost to inference-time is a necessary tradeoff that we have to live with (for now at least).
It's more of a problem for many people running those models locally because they have constrained hardware that can't handle those long contexts.
It seems that this paper shows not only that their method is cheaper in terms of fine tuning but also significantly reduces inference time cost for CoTs.
If what they say gets confirmed it looks to me like quite significant contribution?
I mean anybody who understands the maths knows that there is no real reasoning from these models. The only ones hyping this up are VC bros who want the return on their investment money.
[flagged]