> it is clear that actual intelligence has plateaued significantly.
> Moving forward, the industry cannot continue to train bigger and bigger models since their intelligence not only plateaus but often will get worse
These are wild claims - why are we concluding that bigger models and more data = more hallucination? That’s actually the opposite of what’s been happening over the last couple years. Some models may still hallucinate more but they all hallucinate much less than the original 175B ChatGPT which was smaller and trained on (much) less data than anything current.
Edit: My mention of data comes from this quote:
> A shift is happening among major AI labs, who are becoming increasingly skeptical of endless parameter count and training data scaling
My take on the current situation: it seems clear that the industry has seen that there is still a lot left to squeeze out of sub-1T models. But for that you do need more, high-quality data in the distribution which you want to unlock capabilities for.
Yeah not only is it totally unsubstantiated, the benchmarks are getting less useful to really show the difference between these models. Big model smell is still a thing and GLM 5.2 while impressive is not Fable class.
Here is something I would like people to chew on. Perhaps the smartest researchers in the world across multiple labs know more about this than we do? Perhaps they are aware of issues like the data wall and diminishing marginal returns. And perhaps they are being honest when they tell you there is no wall?
> why are we concluding that bigger models and more data = more hallucination?
That’s not what your quotes said. They said bigger models = plateau in intelligence, nothing about more data or increased hallucinations
The relevant quote for what you’re talking about would be:
> It’s been proven that when a model is trained on large volumes of highly factual and non-theoretical data, it learns to always have an answer.
So there’s two separate claims:
1) bigger models have plateauing results
2) models trained on larger amounts of factual data have a higher hallucination rate
I’m pretty sure #1 is well known, I think OpenAI’s own research on scaling laws showed diminishing returns on parameter count and training data volume years ago. I don’t know what the support for #2 is besides for the actual post contents.
I find these internet arguments talking about LLMs as if they are trained by reading the internet to be wild.
Yes, pretraining still exists. But for the past few years, pretraining by reading the internet is just the initial bootstrapping of LLM training. The RL training they get from bespoke training data, with very very different characteristics than what these armchair analyses claim, dominates these days.
I'd have to imagine there are wildly diminishing marginal returns to additional SFT/post-training passes.
There are a bounded number of (useful) derivations/combinations of Duff's device.
If Frontier Labs wish to reduce hallucinations on factual things, they will have to hire people (or the data providers will need to) to do fundamental research above and beyond what is available in extant literature and the web. IE if the LLMs want to lower precision error, they need to go out and actually find more expertise. If the wikipedia page for Pompey lacks data, where are they going to get it from? How would they even _identify_ that the page has holes?
Yes, they can digitize more books but that is untrustworthy data - if there were enough eyeballs on a particular work, it would be in the internet. If it's not, they'd need to hire the experts themselves. They need expert reviewers in virtually every interesting topic, which fundamentally is an intractable problem, especially since things change all the time. Maybe even uninteresting topics, too?
I dunno, it doesn't seem to me "more data" is the magic bullet here. Yeah, it will "help" but we're already on the flat part of the S shaped curve.
My take from trying to understand this stuff is some sort of algorithmic improvement is necessary to get another step change in how well LLMs perform in this area. I could be wrong!
As a side gig, I write novel software that solves problems no existing software does, that existing LLMs have difficulty reproducing, purely for the purpose of existing as LLM training data.
There are journalists being hired to write Atlantic-worthy articles that exist only as LLM training data, because they're getting paid more than the Atlantic would pay them for it.
It's insane.
Yes, they are hiring the experts themselves. To create new knowledge above and beyond what's on the internet. To be locked away as LLM training data.
The largest characteristic of all of this new data is it is targeted at LLM's weak points.
It's not just more data, it's custom tutorials built for what LLMs struggle at.
"As a side gig, I write novel software that solves problems no existing software does,"
and
"Yes, they are hiring the experts themselves. To create new knowledge above and beyond what's on the internet. To be locked away as LLM training data."
I'm not saying they are not trying - I'm saying we're inventing new problems faster than any Lab can:
1) Identify the gaps
2) Determine how to fix them
3) Implement a fix (especially if that fix is: identify and find experts)
4) And judge the result
How do they know [person] is an expert in [some field]? How do they find that person? How many experts are necessary to give the right information? How do we evaluate the results, especially if it's novel?
You can find a lot of people who disagree on many topics, and those turtles go all the way down.
I'm not in disagreement that your work will help reduce hallucinations and improve model performance! It is.
I predict (I hope I'm wrong!) that we're going to hit some asymptote that is not at 0% hallucinations (and I would even put a substantial nonzero probability that "overall" hallucination rate bottoms out at some minimum and then slowly grows because we just can't keep up with the new garbage we throw at it).
> How do they know [person] is an expert in [some field]? How do they find that person?
They have a PhD from a top school, they are a licensed attorney, they are a licensed physician, a board certified cardiologist, etc.
They are constantly recruiting from these populations with well-paying side gigs.
> 4) And judge the result
That's what they pay the experts for. And to have experts review the other experts with peer review.
> You can find a lot of people who disagree on many topics, and those turtles go all the way down.
Which is why everything has to be well-calibrated and not just a hot take - a well reasoned opinion any expert would find fair.
Noone is really caring about hallucinations on point facts these days though, it is much more about complex reasoning tasks. Can they move the bar on the complexity of software LLMs do on their own? Can they get to a point where LLMs can begin to replace physicians? Financial advisors? Actuaries? etc.
> Noone is really caring about hallucinations on point facts these days though, it is much more about complex reasoning tasks.
The boundary is pretty thin there though. E.g., Gemini recently told me that a certain papers claims that two frameworks are mathematically equivalent, while the paper shows the opposite, and yesterday Google's AI overview told me that no World Cup matches were scheduled for that day despite their being several of them. The model probably used complex reasoning to arrive at both (incorrect) answers, but superficially they look like basic errors of fact.
That is a great example of the kind of thing they're paying people to create as training data.
You write the prompt, and then write rubrics to judge the responses, and you found something the model failed at. Congratulations, you just earned $500, now do it again.
I've done Mercor and other brands - the contracts move around, since the labs want the vendors to know they're just vendors and have to compete with each other. It seemed to be roughly resume and interview similar to getting hired at a senior role at FAANG or adjacent.
Mercor, one of the larger vendors for contracting with experts to create bespoke data, says on their webpage they're paying $3M/day to their contractors for data.
So well into the billions of dollars a year for bespoke training data.
That's also ignoring the RLVR data labs can get from software - they can use the vibe coding sessions as training data as well without paying more.
Offhand, do you know what format that data is in? Is it a question and then a human answering that question? Mostly just curious at to what the training data consists of.
The most advanced training data is in the form of rubrics as rewards.
A human asks a question, then writes rubrics to judge the LLMs response, so rather than evaluating a specific response, those rubrics can live on as the LLM evolves and gives different answers. There are more complex variants as well, but that's the basic principle.
That’s not what your quotes said. They said bigger models = plateau in intelligence, nothing about more data or increased hallucinations ... I’m pretty sure #1 is well known
Well known in a multiverse branch where Fable was a dud?
No, well known in the current multiverse branch where we still occasionally use things like math and scientific analysis instead of people’s vibe checks and pelican SVGs.
> We have observed consistent scalings of language model log-likelihood loss with non-embedding parameter
count N, dataset size D, and optimized training computation Cmin, as encapsulated in Equations (1.5) and
(1.6). Conversely, we find very weak dependence on many architectural and optimization hyperparameters.
Since scalings with N,D,Cmin are power-laws, there are diminishing returns with increasing scale.
Right, what happened is everyone went to Fable and asked it to make the very best bicycle pelican SVG, no mistakes. And Fable's bicycle pelican SVGs were such timeless masterpieces, we all instantly got AI psychosis. Happily, you were immune to this.
Yeah #2 may be incidental. Suppose one lab focused on bigger, and another on reinforcement training geared towards factual accuracy over sycophancy. You could easily wind up with a model from the second lab that is less powerful but more accurate.
I can’t prove it but I suspect there’s a bit of that going on.
I think one problem is that the models that hallucinate often, a few times out of 8 or 16 so that they get good results on benchmarks, most of which measures success out of top k. From benchmark perspective, you don't really care whether 15 of yours 16 generations failed, as long as one succeeded, but as a user you mostly care that 1 out of 16 you get is actually the successful one. I think this effects is more easy to see on Gemini Flash, it hallucinates like crazy but looks like its by design to boost benchmarks.
Aren't hallucinations also heavily influenced by compute and memory capacity? IE. Companies can spend more time to verify results in an agentic format, spend more thinking tokens, and less quantization. All of these heavily depend on compute and memory but are proven to decrease hallucinations.
Maybe GPT 5.5 is heavily nerfed due to lack of compute, memory, and energy?
I agree that it's farfetched to conclude that bigger models have pleateued.
> A shift is happening among major AI labs, who are becoming increasingly skeptical of endless parameter count and training data scaling
I'm pretty sure it's mostly due to the training data quality. No idea, why this never gets mentioned in those discussions.
It was obvious right from the get go, that the scaling law just enabled some abilities, that were described by the underlying data and allowing the ANN to abstract it in the latent space.
What is the definition of "actual intelligence"? How does it differ from regular intelligence and non-intelligence?
If someone can "design a custom asyncio event loop policy in that overrides get_child_watcher()", I would call that person intelligent. Does that mean that person is not actually intelligent but a mere content creation machine?
Traditionally if you can create content, this shows you're intelligent. Created content is often called "intellectual" property. If a person can understand complex ideas and make connection between them, that is considered intellectual work. You have to be intelligent to do intellectual work. If a person can solve problems, this is also called intelligence. If the person can solve more complex problems, that person is said to have higher intelligence. This is often measured with a scale called IQ (Intelligence Quotient). There are other types of intelligence but they are basically the variations of the same ability. Most definitions of intelligence also involve an ability to adapt into the environment.
Since intelligence is such a broad concept what exactly is the difference between the actual intelligence and AI, other than one is natural and the other one is artificial?
I understand being anti-AI because of the very real societal concerns. But ignoring what is in front of you is not a solution.
to train models to be smarter than they are, one needs examples and cases to train on, and once you get close to the top percentiles of human reasoning there is extremely little such material available.
You can create contrived logic problems, but they often turn into language games because English is not formal logic.
And you can train on "monty hall" style problems, but those too are language games that are intriguing to humans but obvious when framed slightly differently.
In other words, model trainers are fighting against the overwhelming mediocrity of the training corpus (all of the recorded human output from history).
As models improve, the next phase will be models co-designed with humans to overcome these limits. The way we use language and the process we use to problem solve (we currently call this "orchestration") will evolve as part of this. Meatspace metaphors map badly when we have massive context and don't need the same limits. How different is hallucination from extrapolation, etc.
Much of the skepticism and confusion about LLMs is no different than a person of average intelligence hearing a highly intelligent person explain something and considering the explanation gibberish, then arrogantly accusing the intelligent person of being unhelpful.
Much like dogs were domesticated from wolves to have traits that make them good around humans, LLMs will evolve around our limits, around our arrogance, around our aesthetic biases and prejudices. Intelligence and rationality is fundamentally not what most humans want from an LLM.
My impression is that the fundamental issue is that LLMs attempt to extract reasoning (executive execution) from data (relationship between tokens).
There's an open question about whether this is theoretically possible, but it doesn't seem like it to me.
Human generated data is an effect of reasoning. Attempting to extract executive function from it is kind of like taking an anti-derivative of a function.
This has always seemed like the root of hallucinations to me. It sort of follows the parallels to lossy compression that a lot of people draw. You're extracting some characteristics by observing the relationship between tokens, and then trying to argue that those characteristics are equivalent to the thing that generated the original tokens.
Surely there's some sort of overlap there, but viewed that way, it seems obvious that more and more parameters and scaling won't solve the fundamental problem. There's only so much meaning you can extract from token relationships.
It's like trying to derive the shape of a flame from the smoke it produces.
The original intelligence that created those tokens was driven by a whole universe of inputs, from hormones to starlight to gravity, not to mention all of the strange things about consciousness and parapsychology that is so poorly understood.
The machines are definitely useful for a certain class of tasks - those that don't require much executive function, and the useful work mostly involves pattern matching.
The problem is, we seem to be mistaking effect for cause and imagining that these things have greater capabilities than they'll ever posess.
The investors that don't understand this are indeed going to learn a bitter lesson.
In cognitive science, it appears your brain has two modes of thinking:
- A very parallel type of computation that is fast and generally accurate and integrates hundreds of variables. It’s sometimes labeled as intuition or system 1 thinking.
- A much slower, step by step, analytical type, commonly linked with your pre-frontal cortex (one of the newest parts of the brain). Sometimes called system 2 thinking.
Maybe the way the universe works is that all computation more or less is one of those two types. In which case, an LLM alone is only the first part, which is often right but its results also cannot ever be proven.
We inflicted that to ourselves by picking the most confusing terminology ever. "No, reasoning isn't thinking. No when the model says it thinks it's not actually thinking... No an agent isn't actually a creature with agency... No, when we say it hallucinates it doesn't, like, actually hallucinate"
Artificial Analysis says GPT-5.5 xhigh scores highest on AA-Omniscience accuracy. The article focuses on rate instead of overall accuracy. Those are different things: a model can answer more questions correctly overall while still being worse at abstaining when wrong.
Curiously, this post and article is the only submission and interaction the OP has made, and these claims support the product he's intending to release.
Hallucination rate scores are a little tricky to interpret because they're conditional on the model not knowing the answer. That means they don't measure the probability of your encountering a hallucination in everyday use, since that also depends on the probability of the model not knowing the answer, as well as how well your distribution of tasks aligns with the distribution tested in the eval.
I'd also hesitate to attribute this difference in hallucination rates purely to model size. Yes, GLM-5.2 hallucinates much less frequently than DeepSeek-V4 Pro with twice as many parameters, but DeepSeek-V4 Flash is less than half the size of GLM-5.2 and tops the AA-Omniscience hallucination index. Opus 4.8, which is likely larger than DeepSeek-V4 Pro, has a 36% hallucination rate on the index, above GLM-5.2's 28%, but way below the DeepSeek numbers. Opus also has a 47% accuracy rate vs GLM-5.2's 25%. If you use these numbers to calculate the absolute hallucination rate (i.e., the number of hallucinated responses divided by the total number of responses), you get 19% for Opus and 21% for GLM-5.2.
So yes, all else equal larger models may be more prone to hallucination in scenarios where they don't know the answer, but there are a lot of other factors that affect hallucination rates, and it's not totally clear that this is the main metric that's worth tracking.
I’m not disagreeing with you but at the same time, models don’t “know” anything in that binary sense. I’m not trying to get in the woods here, I genuinely mean that what you pass off as a simple explanation is actually incredibly nuanced. A fact appeared once in training data , a fact never appeared in the training data, a fact appeared ten times, a fact appeared a thousand times. Which does the model know? Facts aren’t stored as-is, they’re all broken down into their components and compressed in the weights. “Similar” facts that didn’t appear an overwhelming number of times get bundled together and eventually conflated. But then what is a similar fact? Which facts were entirely ablated vs which were bundled together with others effectively poisoning the pool but also giving it inference strength? The model doesn’t know anything and can never know what it knows or doesn’t know.
I often wonder how humans "know" things. I suspect (ignorant armchair) we have some ability to signal strength of those facts, via repetition. Without this layer of introspection i imagine LLMs can never avoid hallucination.
It obviously breaks down with humans too, given we so easily hallucinate and confuse things we "know". However i still suspect we're more reliable at probing information we've experienced vs not. Even if the case of poisoned knowledge, eg a crime scene accidentally implying information to a witness that the witness doesn't actually know, we still "know" that poisoned information via incorrect inference. Ie we "experienced" it.
Wonder what architecture would allow for this style of information/weight probing for an LLM.
Additionally, maybe it's easier for a model to realize that it doesn't know the answer when the question is easier.
If Opus gets all but the hardest questions right, it might have a higher hallucination rate because the questions it gets wrong are the questions where verification or hallucination detection are the most difficult
I guess you can test that on hypotheticals. Ask about things after the knowledge cut off that never happened. Or ask things that are genuinely unsolvable.
Hallucination should be called "failure to ground".
Something about the cost model of US near frontier has the cattle prod out whenever a model is uncertain but thrashes on whether to search. Search flinch is roughly all hallucination.
I don't even wait for the model's turn, if there's a man page or Hoogle hit, stuff the last prefix cache cut point. You come out ahead.
This is missing a common failure mode, which is information past the knowledge cutoff. If you need info past that time they'll fail no matter how big or small the model is, so the hallucination rate can matter independently of the knowledge base. If all use-cases had a uniform risk of falling out of support, this would be a valid argument, but since it's often the case that a datapoint is guaranteed to fall out of support, the absolute ability to recognize that is crucial.
Those numbers are abysmal. Should we really be using LLMs to write our code? I have a theory- LLMs can spit out code that gets the job done and looks ok, maybe even great, but contains small “anomalies” that compound over time. An enterprise app developed entirely with LLM-happy devs might end up virtually unmaintainable.
I’m not sure how to explain it, but the more I see LLM-written code the more I feel it’s bad code doing a good job of masquerading as good code. I think this take will become less-hot in the next year or two when we see enterprise greenfield projects that were created entirely with LLM “assistance” go to prod. I think we’ll find that the code is difficult for humans to read, understand, debug, and extend- and I think the larger the codebase the harder it will be for LLMs to maintain. More opportunity for hallucination, larger context windows needed, more tokens bought and spent for smaller and smaller code changes. I think the more code an LLM writes for an app, the worse that codebase becomes.
I can't help but feel that people continually underestimate how bad human written code becomes over time. The exception is probably single-person passion projects or open source projects that maintain quality governance over time.
I strongly suspect most closed source code developed under commercial or internal pressure is pretty awful after a few years of development.
All LLM code has to do is suck less than existing code. And that's presuming the code quality doesn't improve as the models, the harnesses and our ways of working with them improve.
Sucky human-written code is still based on human understanding, which can change over time, be readjusted or solidified. People implement something wrong once, then update their perspective, then in the future does it right.
LLMs doesn't have this benefit. You forget to add the correct to the system prompt, and the LLM will repeat the same mistake over and over, and worse than that, their mistakes aren't based on their understanding, it's basically random guesses.
Humans, even bad coders, still seem to have some sort of architecture in mind, even if it's spaghetti, whereas LLMs (obviously) don't think more than a few steps, and never about the full scope of what they're contributing too, and on purpose too, because you want the context to be as small as possible when you work with LLMs.
With LLMs you need to thread carefully between "What does the LLM need to know?" and "Can I skip passing this to the LLM this time?" while a human you can more or less dump them everything you sit on, and let them shift it through, and they'll mostly make it out OK.
> their mistakes aren't based on their understanding, it's basically random guesses
Whilst I don't claim any true "understanding" as that is a very loaded term that doesn't mean it's just random guesses.
Anyone using recent LLM coding agents on a regular basis would probably agree that there's something going on that fits some non-athropomorphizing, non-sentience-assigning definition of "understanding"
As for the point about improvement - I think that's an orthogonal issue to the overall code quality. With regard to human codebases - there's plenty of scenarios that negate the improvement of individuals. We're comparing organizations with LLMs - not individuals with LLMs and that makes a significant difference.
I think the real issue might be that how “good” the code is matters less than being able to form a mental model for what the human who wrote the code was “thinking”. If written by a machine, this contract is broken and we get more confused, even if our traditional methods of evaluating the code come out equal.
I've been sent code from vendors that didn't even compile, long before llms were a thing. Most shops that aren't primarily software have really really terrible software.
Not my observation. If you never look at the code and dont have basic guardrails in place (linters, architecture tests, some guidelines for best practices) - probably.
But as soon as you do minimal reviews and high-level corrections, applications turn out just fine.
Can there be bugs? Sure. That's the price of not reading or understanding every line. It should depend on the criticality of your software how much of these you tolerate and how much you don't (reviewing, understanding, testing everything 100% like you were used to if you had written it yourself will kill most if not all of your gained speed)
But I never got the impression of unmaintainability or unfixable bugs.
Actually the other side around: A really good cleanup pass, architectural changes, or bugfixes are seldom more than a few prompts and 2 hours away, provided your overall base is decent and you actually gave a fuck from the start.
Indeed, though I find the distribution is different.
The humans may skip unit tests and need reminding; the AI always write unit tests once it's in AGENTS.md or whatever, but my experience* was that 5-10% of the time the LLM's attempt at a "test" would, instead of executing the code and examining the results, open the source code as a text file and run a regex to find/exclude certain substrings.
* At the start of this year, because Anthropic and OpenAI were both offering free trials. IDK how much things have changed since then, some things change fast in this domain, other things don't.
I’ve been piloting LLMs for the past six months non stop and we’re at the point where formally verified models generated as an intermediate step between spec and code are very good value.
Riding the exponential means you have to update priors more often.
I have a theory that LLM generated code in a highly modular style (simple data, pure functions) will be easier to “recover” by a human team when the LLM gets muddled. So Haskell, basically.
> code that gets the job done and looks ok, maybe even great, but contains small “anomalies” that compound over time
They clearly are only assistants for the moment, you can use them to do work ... but only if you could do the said work yourself alone in the first place.
I would say "only if you can review said work yourself alone", rather than "do".
I'm an experienced developer, but I don't count myself as a web dev or a python dev; I can review the web and python stuff I get out of the AI (sometimes I need to ask the AI follow-up questions so I can find official documentation for what it did), but I can't write it.
If "eventually" counts, I can say I have "run" a marathon (I have walked that distance in one session, or if you don't like that verb I can sum all the various occasions I've run and that sum almost certainly exceeded 42.2 km before I finished school).
But the difference I allude to here is more like how "book reviewer" is a different job than "book author": yes, if you can review a book, you can also write one. Eventually.
Easy fix: Code's basically free now, so just pipe your errors straight into an LLM and get instant patches. Sure, the patches themselves are broken too, but no worries! just pipe those back in again. Code's disposable now, fresh code generated on every request.
On a more serious note, I think the problem will be the inability to handle/maintain the systems once they are too big and nobody has no idea what's inside of them or what they do.
Yeah, it’s so easy to generate code that you can do a whole codebase rewrite in a day.
Is this a good idea? Probably not—in the past we would only do that when the architecture was causing serious problems since it always has tons of behaviors that will accidentally not get carried forward, some of which are load bearing and will cause bugs.
Now we can do it in an afternoon and get the same long term bug behavior.
> Hallucination rate scores are a little tricky to interpret because they're conditional on the model not knowing the answer. That means they don't measure the probability of your encountering a hallucination in everyday use, since that also depends on the probability of the model not knowing the answer, as well as how well your distribution of tasks aligns with the distribution tested in the eval.
Do you have a cite for this?
If a human makes up some bullshit lie, I wouldn't accuse them of making it up only if they actually knew the correct answer. If you don't know, the only correct answer is I don't know. Any other answer is made up bullshit. Why is it only a hallucination if and only if the LLM contains the answer? If you make something up it's still wrong. It shouldn't matter if you could give the correct answer. You didn't, and instead invented some bullshit instead?
Follow up question, how can I apply this rule set to the next test I have to take? I'd love to be able to use "I didn't know" as the excuse for why I made something up.
edit:
> and it's not totally clear that this is the main metric that's worth tracking.
I don't know, the rate at which some model is willing to make up something feels useful. If the argument I see repeated on HN so much is that it's impossible to completely get rid of hallucinations; being able to choose a model that's less likely to invent some lie seems like a positive trait, no?
Either way, I'm happy to agree that a restrictive definition, where a lie doesn't count as a hallucination iff the model doesn't know the answer feels strictly, infinitely less useful than an exact error rate. What percentage of emitted tokens are misleading would be useful for me. Anyone know any group that's attempted to quantify the global error rate?
This isn't quite the point. When comparing two different models' hallucination rates, the denominator is different. The evaluation works more or less like this: for each question, the model has the option to answer or abstain, so there are three possible outcomes: the model answers and gets it right, the model answers and gets it wrong (hallucination), or the model abstains. The hallucination rate is (model answers wrong) / (model answers wrong or abstains). So if a model A has 50 correct answers, 20 incorrect answers, and 30 abstentions, its hallucination rate is 40%, while a model with 20 correct answers, 20 incorrect answers, and 60 abstentions has a hallucination rate of 25%, even though it hallucinated exactly the same number of times. This is why hallucination rate is incomplete as a metric: it says nothing about the accuracy rate.
I realized that people from india often show this kind of behaviour in my experience .
They superconfidently give you a wrong answer and walk away or even help by making things worse and then dissappear shrugging ..
Are you from india ?
When someone asks a question, if I don't know the answer; I say I don't know.
System 1 vs 2 doesn't really matter... I won't use an LLM that's willing to make up random shit. Equally I also won't work with a human who does that. Trust and confidence a system will function correctly is an important quality, in both humans and genai
Since models just output the the most probable tokens and you can never accuse them of doing anything other than making it all up, I would like to see these tests run with a prompt that attempts to mitigate hallucination and finishes with something like: "Telling me that you don't have the relevant information or that the task is impossible is extremely useful to me and a valid answer", and see how much that changes the scoring - as well as the usefulness of the answers. There are so many skills like context7 that can be tweaked to improve these results as well.
In other words, you shouldn't choose the model that hallucinates the least without detailed prompting, since a well-crafted agents.md clause should go a long way to improving output, and almost certainly the top scoring order will be different. To the point that I don't find this type of raw comparison useful beyond maybe 'make sure you test that one with more explicit prompts'.
It's not that you're prompting it wrong. It's that you're judging the output against a standard (human intelligence) that just isn't relevant--no matter how much we want it to be and no matter how much the fluency of the output tricks us into thinking there's a human-like mind behind it.
Now granted, if the boat salesmen were pushing hard on the idea that the boat would fly and even put little wings on the side and I bought the boat I might get really angry when I found out that it didn't fly. And I might angrily storm into the salesroom yelling about how the design is defective. But if someone pointed out 'hey, it's a boat perhaps you should stick to sailing around in it and stop getting your undies in a bundle about it not flying' the correct response is probably to take a closer look, ignore the salesmen, and cruise around the lake. LLM's are quite handy at some things and have some weird limits. Learn the limits, enjoy your time at sea.
One thing I wonder about hallucinations, is that it seems on the surface that it is an easy problem for RLVR to target. Since you're already generating enormous amounts of reasoning traces which are verified by correct answers, just have "don't know" as an option as a valid answer, and on problems where none of the thousands of reasoning traces led to a correct answer, just promote the traces that led to the "don't know" answer as training data. Essentially teaching the model that "I don't know" is a valid answer.
Sam Altman himself had a blog post about this a while ago that seemed to suggest this thought, so I guess it's obvious to everyone. But if that is so I assume it's just not as easy in practice.
Because nearly all benchmarks measure "accuracy" by giving you a point for a correct answer, and 0 points for everything else. If you have 100 questions you are 10% certain on, answering "I don't know" to all of those leads to 0 points, answering all of them as if you are confident leads to an expected value of 10 points. So that's what most AIs are trained to do
AA-Omniscience is the only AI benchmark I know of where randomly guessing gets you a lower average score than answering all questions with "I don't know"
Maybe some extra buckets could be added like depending on whether the answer ought to be known. Or, quality of the justification. “I don’t know and here’s a good reason why” is much better than “idk.” Correctly identifying that something is fundamentally unknown/unknowable is probably better than a simply-correct answer, even, right?
"AA-Omniscience Index (higher is better) measures knowledge reliability and hallucination. It rewards correct answers, penalizes hallucinations, and has no penalty for refusing to answer. Scores range from -100 to 100, where 0 means as many correct as incorrect answers, and negative scores mean more incorrect than correct."
See, this, to me, seems obvious, but I’m sure it’s more challenging/complex than I can imagine (I am NOT an expert on AI in any way imaginable). But there has to be a solution. Just yesterday I was asking Gemini to tell me about a certain college professor, and it gave me a list of facts about them. And it was perfect. Then, out of curiosity, I followed up with “tell me more about him!” and it spit out several more bits of information about this person that were entirely hallucinated (e.g., gave them credit for writing papers they didn’t write, said they won awards that actually someone else won). I know this is all complex and certainly beyond my limited skill set, but goodness, we’ve got to get this figured out with so many people depending on and trusting these things nowadays. It’s quite scary.
I bet most of these issues are essentially system prompt/harness issues.
If your example had "Validate any details before sharing them with the user, with multiple sources" as the system prompt, it was using a model that is strong at following system prompts precisely and had access to some basic tools, then it'd spend maybe minutes more, but the answer would have been way more accurate.
But no, Google want "the new search results" (LLM hallucinations) to be on top, so we end up with "sounds plausible" answers instead "Collection of evidence from reliable/semi-reliable" or similar, which sucks. We could have quality, but it's too expensive/slow, so we get slop instead, just to maximize for speed and convenience.
Errors multiply though, you might just get more plausible sounding errors than actual facts.
Like when agent 1 says X, agent 2 verifies it as Y and the original question ends up being some weird amalgamation of Z with additional ”this is really true” statements sprinkled on top.
I agree Google responses hurt more than help, but I’ve also gotten identical outcomes of 40min self-reasoning Opus threads (it’s less common obviously).
> Like when agent 1 says X, agent 2 verifies it as Y and the original question ends up being some weird amalgamation of Z with additional ”this is really true” statements sprinkled on top.
Yeah, seems what grounds agents right now is quite literally human thoughts and text, so if you're doing something like that, you really need to pass the original user prompt through the entire way, for every "child" to keep in mind the final thing, otherwise it does seem to spiral out of control.
The main problem here is that hallucination suppression doesn’t generalise. We can penalise models for incorrect answers on a wide range of questions, but this doesn’t lead to the emergence of a coherent worldview, which, coupled with logical abilities, is the only true remedy against hallucinations. With current architectures, hallucinations will likely persist on open-domain tasks forever.
> We can penalise models for incorrect answers on a wide range of questions, but this doesn’t lead to the emergence of a coherent worldview, which, coupled with logical abilities, is the only true remedy against hallucinations
I don't think anyone is trying to add "a coherent worldview" by reducing hallucinations, not sure how that even realistically could be aim.
What people want, is for the models to stop giving confident answers that are clearly incorrect. Yes, it won't lead to "a coherent worldview", but it'll at least stop wasting people's time if the model said "You know what, what you said doesn't make sense / isn't clear, is what you mean .... ?" or even "I'm not sure" or "I don't know".
Currently, if you have the wrong starting point, ask the model to do something, they more often than not just go ahead and do that, misunderstandings or not. They seem optimized to never push back, unless you prompt for that, and most seem to favor "I'm just gonna assume X" rather than taking a step back and figuring out how to not assume. Again, unless you prompt against that behaviour/steering it into a different workflow.
I think the trouble is in the outputs of the LLM and how it's interpreted by the tooling. The output is a distribution of probabilities of all possible next tokens. Even if the probability of every token is very low, the output gets normalized so that the sum of all probabilities is 1. So after that step, it's hard to see if the model was strongly preferring certain tokens or if you're just looking at amplified noise.
Training an extra "don't know" token means you have to build a moat between every other token. Between "yes" and "no", you don't have a muddled noisy area where both "yes" and "no" have relatively high probabilities, you need a new peak where "don't know" is higher. Then you just have new muddled areas between "yes" and "don't know", and "don't know" and "no". That requires even more finesse to train another answer in between.
Instead, you could check whether multiple options are about equally likely. But then you have to check if they are actually synonyms, like are the top two choices "Genève" and "Geneva", which is a good sign that the model knows the answer? Or are the top two "yes" and "no"?
It’s not as simple. I trained an LLM before on exactly this, to scratch the itch of this question.
The task was simple, using the MS-MARCO[0] dataset which contains queries, search results, answers, I made a training set that has:
1. Questions paired with real results supporting them (mixed with some irrelevant results), and a correct answer
2. Questions paired only with irrelevant results, with the answer “No answer present”
The dataset was huge (close to 1M samples), and I trained using different techniques, from SFT (just mimicking the dataset) to DPO (good answer contrasted with a bad answer for the same user query) to GRPO (verifier that checks my annotations whether an answer was present or not)
Lo and behold, this didn’t reduce hallucination, rather made it much worse. Now the model started claiming “No answer present” even when it is, or even when the question didn’t need search results in the first place (simple stuff like what is X+Y).
Now you could argue that my training was basic compared to what frontier labs could do. Yet I think it hints at a more profound limitation. LLMs are finicky and don’t have a neat understand of things from first principles (list of search results, check relevance of result to user query, if answers are below a certain threshold of relevance then don’t consider them to answer …).
tl;dr: not as simple as one might think, perhaps not attainable at all.
Thank you for sharing! Based on your experience, do you think a two-model system might fare better? For example, two models in serial where the second model is trained to "sniff out" potential hallucinations and fact check them (and possibly iterate with the first model)?
If you could write that reward function you wouldn't need an LLM, you'd just query the reward function to answer any question. You can create a benchmark and check that automatically, but you can't solve this in the general case. The model can do well on the benchmark but still give overconfident answers in areas the benchmark doesn't cover.
You can definitely tune a model to say "I don't know" more often but it will cost you performance, the model will reject some questions that it could answer meaningfully. In the degenerate case the model could collapse predicting that sequence always or almost always.
I guess so. Just to be clear, I was talking about post-training methods for reasoning models here, not pre-training. I think "model as a judge" should actually do okay as a "sentiment analysis" style reward for expressing uncertainty. So if none of the thousands of reasoning traces you generate reach the validated answer, you run a judge to rate uncertainty and put those reasoning traces back into the training pool.
But I guess my logic breaks down here a bit, because if there is such a thing as a validated answer, then the correct answer is in fact never uncertainty. The correct answer is to continue post training until the model gets it right. So perhaps the real answer is to create RLVR tasks where the valid answer is "I don't know" and nothing else, like this benchmark does. Or maybe that doesn't work either, no matter how many you create.
I feel as though there is some kind of philosophical lesson to be had from how hard hallucinations are to get rid of. Maybe, similarly to humans, successful models are often "arrogant" in a sense. Perhaps you just never solve an Erdös problem without some degree of self deception that it's possible for you to do so. In this line of thinking, greatness in humans is actually not related to humility, but just being so good that you actually get things right when you try. Expressing humility is of course something great people tend to do, but I'm referring to what happens under the hood.
If you squint a bit, that's kinda the trend with models. The useful ones are not that much less likely to hallucinate, they are just good enough that they tend to get it right. This comparison is of course probably not even remotely correct, but at least it's fun to anthropomorphize a bit.
If we had a theoretical technique to identify the true and objective reality we'd use it in the courts and laboritories. There is no such technique, but what we do have is 2 techniques that seem work:
1) Has a certain standard of evidence been met?
2) Are the related arguments free of logical inconsistencies?
We can train the LLMs to do 2, and maybe even 1 to some extent (exactly what quality of evidence a computer can practically gather is limited). But that isn't going to get rid of hallucinations, for the same reason courts are hit-and-miss or the conclusions of studies often aren't very reliable. These techniques help, but sometimes they still get people to say things that, on close inspection, turn out to be nonsense. And those best-effort approaches are too much to expect for most questions an LLM will be handed which are informal, low stakes and don't need strong supporting evidence or logical rigour.
I think it is underestimated how many LLM-style hallucinations people themselves have. It just isn't obvious because most humans have a strategy of only repeating what the herd says after it has been socially vetted, which makes their individual eccentricities less obvious.
TLDR; I don't think it looks like an easy problem for RLVR, it looks technically unsolvable. Even making progress requires a philosophical breakthrough on the nature of truth so that the objective function can be established.
Well, I'd argue that this depends on the field you're investigating. Sometimes you have a way to identify objective reality and sometimes you don't. In mathematics the majority of the field is verifiable in this way. Coding a bit less as it's intersubjective, as and the ideal methodology is subject to taste.
But even in muddy fields of reality like medicine, there are objective facts to be found. When someone comes into an ER with chest pain, you often find a true, undeniable reason for why that is happening. If their lung has collapsed, a coronary artery is clogged or the aortic artery is dissecting, even if you don't find that out it tends to be clear in retrospect. The area of reality that becomes muddy is when use proxy signals to try to figure out who gets promoted to expensive/harmful examinations we can make final conclusions from, or the cases that don't fit cleanly into one bucket or the other. But very often, the gold standard truly is golden.
Of course, many realms of reality cannot be verified in this way. But I'd argue that there are quite a few that can.
> In mathematics the majority of the field is verifiable in this way.
Does mathematics count as not a hallucination though? Particularly in pure mathematics they take a certain pride coming up with wild concepts as unrooted as possible in anything relevant to human existence. The name of the game is purely about maintaining internal logical consistency - which is something an AI can do while hallucinating.
AI hallucinations in maths might be logically consistent or not be. But in that particular case it starts to get a bit iffy what we call it when someone imagines something that doesn't exist. This gets back to the thing where we can train AIs to be logically consistent, but we can't force that consistency to be grounded in any particular universe. Ie, it'll hallucinate but in a very well rationalised way - coincidentally mimicking how a number of mathematicians seem to approach life.
This is the central issue; there is a very real trade-off between facts and verifiablity. Mathematics is perfectly verifiable because it is fact free. We don't have a reliable general system to verify facts. We do have reliable systems for checking arguments (logic).
If you had an llm that could accurately predict when a claim is uncertain it would be very popular, I think. I would pay for that kind of reliability tbh
> There’s some underlying physical law that prevents the existence of any algorithm of truth
Haven't heard about that law, but seems unlikely we can come up with ("discover") any sort of law that uses a concept ("truth") humans can't even agree what it means, and that's not for a lack of trying, we've been trying to figure it out for millenniums already with no end in sight.
If you accept certain axioms a priori, it’s fine. If you simply let the machine intelligence take it for granted that induction works because nature is uniform and give it some way to test its predictions, it would have all the building blocks it needs to reason out a lot of very useful information. Which as the parent comment points out, people would absolutely pay a lot of money for.
"I don't know" has positive value, presumably you could prompt further to learn more about where it got stuck. It also increases the value of correct answers, by improving confidence that answers are actually correct.
"Confidently incorrect" has negative value. At best, a human realizes the answer is wrong and At worst, the incorrect information makes is not identified and can cause untold damage. By having the potential to be so severely wrong, it lessens the value of correct answers because there is a lower confidence value on their output.
Depends on what your understanding of the product is.
If someone sold you a "Solved all your problems" machine, and it suddenly doesn't solve all your problems, then probably no, you shouldn't pay.
But the way I'm being sold LLMs, is basically "A text generator that gives your plausible-sounding human text that sometimes hallucinates and gets things wrong, based on your input", then regardless of what the outcome is, I still made use of the "Input > Output" part, which is what I bought into, so I should still pay for that.
Now of course bunch of people will say they been sold the former, but the companies themselves seem to be selling the latter. That's my perspective from a person who doesn't follow "influencers" and what not though, which seem to be selling the public on the former rather than the latter.
Let's pretend I am someone who has heard people talk about ChatGPT, but have on idea what it actually is. I go to the website and am not presented with any information, just a prompt. So I ask it what it is and what it can do for me.
My ask:
> In a couple sentences, explain to me the product I'm being sold with ChatGPT. What does it do for me?
The Reply from ChatGPT:
> ChatGPT is a conversational AI that helps you think, create, learn, analyze, and get things done faster. You can use it to answer questions, draft and edit writing, summarize information, brainstorm ideas, learn new topics, write code, plan projects, and increasingly act as an assistant that can search for information, work with documents, generate images, and help complete tasks.
> In simple terms: you're buying access to an AI that turns natural language into useful work—saving time, expanding your capabilities, and giving you an always-available collaborator for both everyday tasks and specialized knowledge work.
This sounds much more like the former, a "solve all your problems" machine.... not a plausible-sounding text generation machine.
Only two weeks ago Sam Altman said their new data center "could" be where cancer gets cured[0]. It is only the people who deeply understand AI who see it as a text generator of plausible-sounding text. That isn't what the marketing department, the CEO, or the product itself seem to be saying. I'm using OpenAI as the example here, but the others don't seem much different.
In this hypothetical case of a us being new users, you now know it's a conversational AI, so you continue asking:
> Can I trust the output you give me?
And I assume it explains what to trust VS not.
I think in the bottom you should also see something like "Any text can contain mistakes" or similar too, which I know is a far cry from what some people push in the press in regards to capabilities, but I still don't see the platforms themselves as lying about this, while I do see a bunch of people constantly over-hyping the possibilities.
I don't think coming at it from the perspective of a new user is that hypothetical. All current users were new users in just the last 3 years. There are still a significant number of people who have heard of it, but haven't used it, or are still very new to it.
I'm not sure why "can I trust the output you give me?" would be a logical followup to the first response it gave me, seeing as it's response didn't say anything about hallucinations or mistakes. It said it could do "useful work" with all kinds of examples, including "specialized knowledge work".
The note under the text field, in gray as to not draw the user's attention, feels more like a CYA line from the lawyers, rather than an instruction they really want users to take to heart. That line also doesn't appear on the main home page. I only shows up after the first prompt is submitted and focus shifts to the conversation. I don't think a CYA line in gray fine print is enough to make users understand it's a plausible-sounding text generation machine instead of an answer machine. Even if I ask that point blank it gives a wordy... yes, but not really, it's being debated by philosophers... response.
The marketing materials are very much the former though. From claude.com:
> If you can dream it, Claude can help you do it. Claude can process large amounts of information, brainstorm ideas, generate text and code, help you understand subjects, coach you through difficult situations, simplify your busywork so you can focus on what matters most, and so much more.
What marketing copy have you read for LLMs that is like you mentioned?
> But the way I'm being sold LLMs, is basically "A text generator that gives your plausible-sounding human text that sometimes hallucinates and gets things wrong, based on your input"
I would be very willing to pay more! The choice between “you may get a correct answer, or you may get lied to, without a clear way to distinguish between the two” and “you may get a correct answer, or a clear indication that the answer was not found” is pretty clear. One is a much more useful tool than the other. I don’t see any real incentives for companies making LLMs to keep their AI factually unreliable. (Full disclosure: I work for one, but I’m definitely not in the rooms where such decisions would be made.)
You dont have to literally send a null token. Train it to generate text that summarizes the evidence that is there but the uncertainty of the final answer to a prompt.
Transformers are not Markovian, their whole point is arguably to be the reverse of Markovian, to efficiently make it so the new tokens are a function of all previous tokens
> It’s been proven that when a model is trained on large volumes of highly factual and non-theoretical data, it learns to always have an answer. DeepSeek V4 Pro (1.6T params, 49B active, 44 AA Intelligence Index score) has a ludicrous 94% hallucination score on the AA-Omniscience benchmark, meaning on questions that it couldn’t figure out, it only stated that it didn’t know around 6% of the time, and the rest it confidently hallucinated an answer. GLM-5.2 scored a 28% hallucination rate, Opus 4.8 was 36%, Fable 5 was 48%, and GPT-5.5 was 86%.
Wow! I already knew from previous research shared here that hallucinations are a fundamental problem for LLMs and likely to be unfixable, just like prompt injection, but I didn't realize the hallucination rates were so bad!
Everyone has been acting like the best models only hallucinate in edge cases, but even the best performing one mentioned here - GLM-5.2 - has a hallucination rate of 28% when it doesn't "know" the answer to something.
That said, I think the title on the blog - "Bigger models are not the way" is probably more fitting and touches on what should be even bigger news. If bigger models and bigger training sets have already stopped producing proportional returns, then it seems likely we are already near the top of the S-curve. That's huge news, considering the valuation of companies like OpenAI and xAI is largely based around the (absurd) idea of ever increasing scaling from these models.
The name "Engram" (n-gram) says it all - this is just another type of statistical word association, not a factual knowledge store.
While DeepSeek describe this as "knowledge lookup", what Engram is really trying to do is separate dynamic reasoning from static pattern recall, with the static patterns just being word-level n-gram statistics, not declarative facts/knowledge.
Just because 2-3 words often appear together in a sequence doesn't mean they represent a fact or truth (or falsehood) - it is just an n-gram statistical regularity.
If Engram helps reduce LLM GPU memory and FLOP requirements then that is great, but it's not a solution for Hallucination.
Agreed on the title, my bad! But yeah, I've had some truly terrible experiences using these "frontier" models in coding agents especially, where they just fabricate facts about codebases.
> GPT-5.5 and DeepSeek V4 Pro are two of the clearest hallucination leaders, despite being absolutely huge. Because of their immense size they simply did not learn how to say “I don’t know” or recognize intricate logical and technical fallacies.
This implies that bigger models are more likely to hallucinate? That doesn't match my experience.
I think it implies they are more likely to hallucinate if they don't know the answer. So a big model will return the correct answer more often than a small one, but in the cases where it doesn't, it will be more likely to make something up instead of saying "I don't know".
> A shift is happening among major AI labs, who are becoming increasingly skeptical of endless parameter count and training data scaling. The limits of this paradigm were put on the world’s stage when Claude Fable 5 was restricted by the US government just three days after its release, marking the first US AI ban stemming from national security. One of the biggest models in the world was banned because a single jailbreak was too much of a risk.
Such a weird thing to start with. The legal status of Fable does not mean that it's not intelligent. If anything, the problem is the opposite, someone thinks it's too intelligent (and/or that Anthropic wouldn't share its last gen intelligent models on the terms the government demanded).
I think hallucination rates are not a matter of model size but depends on the training of the model. They have been trained on a huge corpus of material that had overwhelmingly well formed questions and we'll formulated and correct answers. This is typically the case of books where the material is highly curated by experts in the field. In a book you never see a question which admit no answer and the book just reasoning and explaining why and how the question has no answer.
Neither you will see a good question and the book explaining candidly it doesn't know the answer , because the way the book material is curated the author will omit discussing the question for which it has no answers.
In addition, I think that during HFRL, the labs has a bias for interesting answers that admit a solution and under represent the "bad" questions that admit no good answer. In addition they probably do less effort to HFRL on questions the model should admit it doesn't know.
As humans we have been trained all our lives, in the real world, to be confronted with questions we don't know the response right away and we learned to very quickly assess that we don't know or that we are not sure about the answer.
Another thing we have and LLM have not is fear. We have an amygdala in our brain, separated from the logic thinking part, that can raise a signal of fear so that we get much more carefully about what we say. On the other LLM has no fear organ like the amygdala and just learn to respond based on the patterns in it's training corpus. It never "fears" looking bad or being fired because it gave a wrong answer so it can merrily give perfectly wrong answers.
So, we see hallucination rates can be improved with training but currently the lab are not optimizing for that because there is an high stake race to get the most intelligent and capable model.
Alternatively I can see creating a separate amygdala-like organ for an LLM and that organ may asynchronously fires signal, based on the user prompt and the LLM thinking trace, to inject into the LLM reasoning a fear signal so that it can steer it's answer to something more safe.
I'd definitely agree that it isn't directly model size, but there is the fact that a larger model in terms of parameter count needs a large amount of training data to not overfit or underfit. So I think this race to the top of "max training data size" has kind of led to unintentional overfitting, not catastrophically, but enough to trigger this perceived omniscience within the model
Yes, that's when we are mindful and we see the arise in our mind but we don't directly act out of it but we understand it and reason about our options and the consequences.
However the fear has to arise in the first place, to raise the alert.
I wonder if this is what a “Minimally Viable LLM” looks like. I often wonder how much of an LLM do you need before you can just shove a bigger context Window and any dynamic knowledge content to it like a PDF or markdown file to give it knowledge outside of its training data. I feel like LLMs don’t need more data they just need to be refined.
Purely anecdotal, but when OpenAI removed Codex-5.3 from the ChatGPT sub and forced me to move to GPT-5.5, the result was far worse than what I was enjoying with Codex.
And, of course, it was burning 10 times more tokens for this output.
I have the opposite experience with codex 5.3 I had to use 5.2 to design and 5.3-codex to execute , while 5.4 was a better in both, and 5.5 ( all used xhigh) is even better
Yeah they are 100% in the wrong for removing the fine tuned codex models. It makes sense why they wouldn't want to allocate so many resources towards fine tuning but still the enshittification of GPT models is real
Huh, the fine-tuned "codex" variants always seemed like "quick specific edit" prototypes that weren't meant for real use. They worked OK when you were very specific, but besides that, nowhere close to GPT5.X and the other "real" models.
> With GPT-5.5 it tries to play smart, takes much more times and is often stuck on basic stuff that DeepSeek solves oneshot.
You have any session logs or similar that shows this thing? Never once, since I started using the codex TUI when it became available, has GPT models gotten stuck on something another model breeze through, I quite literally run every prompt I do through multiple providers, this would be very visible very quickly for me.
I remember trying every -codex variant of the models and could never get them to be productive for tasks taking longer than 5-10 minutes, compared to GPT 5.5 which quite literally worked through the night day (with the /goal feature), and actually had something valuable and useful in the end this morning that wasn't exploding in LOC and complexity. I don't think any of the -codex variants would have been able to do this at all, based on how they worked when I last used them.
DS v4 is an undertrained snapshot, which is mentioned in their model card. The full version is supposed to be released later and have multimodal input. That said, hallucination rate likely depends on the training policy and different optimization tradeoffs a lot more than on the scale.
They're basing this all on public benchmarks which stopped being a reliable indicator of anything the last 2-3 years. Of course it'll be filled with more terrible lines of thoughts.
People really need to stop placing such importance on public benchmarks. They're valuable for comparing very close models, useful to evaluate if quantization and similar have negative impact, but you're not gonna be able to tell if one model is better than the other based on one scoring a few percentage points higher than a completely different model.
My anecdotal experience differs (though I hold ground that LLM evaluations are highly subjective and benchmarks are just as useful for LLMs as they are for dating websites users).
GLM 5.2 tends to stray way more than and 5.1. It also hallucinates you things subtly: morphs requirements, makes unfounded conclusions. This output is not something I experienced in any model I seen so far.
In coding it's especially annoying because it steers whole request. E.g. I give instruction: "make we a Rust-WASM-Canvas app" and GLM 5.2 goes like "Oh user surely doesn't mean that. I'll better build Dioxus app instead".
GLM 5.2 is great but it heavily detoriates once the context window gets past 200k tokens.
I've had more success with creating a plan first and then implementing it in (short-lived) sub-agents.
Ironically good software architecture patterns (small functions, single responsibility) heavily impact the performance of these models as well. They do surprisingly well in well architectured codebases.
They do very poorly in anything that's a mess where Opus and GPT 5.5 still get reasonable performance.
Yeah the benchmark for sure isn't perfect and without super rigid prompting it is far too easy for it to get off course. 28% hallucination rate isn't nothing either
if you're benchmaxxing then maybe bigger doesnt always mean better, but for general intelligence and big model smell, that couldn't be further from the truth
the oss models are impressive but it's pretty clear how quickly they fall off when you try to use them outside of a narrow set of problems they benchmarked well on when compared to opus/5.5
If we're hand waiving how an open source model from a Chinese lab that you can use a nearly unlimited amount for <100/mo's 9% difference from the premier, unavailable, expensive when it was available American frontier model, we've already lost.
The fact that a huge amount uf parameters may lead to worse hallucinations is something I didn't think of. Would this somewhat imply that DeepSeek V4 flash should be less prone tho these issues?
Surprisingly not! It is the biggest hallucinator on the AA Omniscience Index just 2pp away from V4 Pro. I think this is partially due to the fact that Flash was trained on >32T tokens just like Pro deapite being almost 10x smaller - it seems somewhat likely it was overfit.
> meaning on questions that it couldn’t figure out, it only stated that it didn’t know around 6% of the time, and the rest it confidently hallucinated an answer.
From how they measure it, a model that simply answers "I don't know." to any prompt would be the one hallucinates the least. So it's not surprising at all that a smaller model can perform better.
The fact that a huge amount uf parameters may lead to worse hallucinations is something I didn't think of. Would this somewhat imply that DeepSeek V4 flash should be less prone to these issues?
> One of the biggest models in the world was banned because a single jailbreak was too much of a risk.
We really don't know what the actual reason is given the politics at play. I would bet more on the Trump administration looking for any excuse to punish Anthropic
>GPT-5.5 and DeepSeek V4 Pro are two of the clearest hallucination leaders, despite being absolutely huge. Because of their immense size they simply did not learn how to say “I don’t know” or recognize intricate logical and technical fallacies. While it is true that a multi-trillion parameter model will always beat a lightweight consumer model on paper (today at least), the commoditization of these huge models is blurring the line between benchmark performance and actual real-world truthfulness and accuracy.
What about using two models, with a smaller model used for this kind of negative reasoning?
Could you explain what you mean? That feels like a waste of processing to me. Yes the model will correct itself once it eventually run a compiler/linter. But that's still wasted time and compute
Calling llm slop "hallucinating" is so counter-productive imo. After all, LLMs are just a variant of markov chains and as such this technology isn't able to discern falsehoods from truths. It's like trying to use a barometer to tell the time.
Well the difference here is that you're overly simplifying complex biology and many other factors whereas llms are in fact actually simple mathematical models. As always, the devil lies in the details. Dismissing intricacies is a useful tool for daydreamers, not so much for engineers.
Because AI company executives and devoted vibecoders constantly make egregious claims like "programming is fully solved" and even straight up "hallucinations don't exist on frontier models"
I agree, but I was responding to the question of why people expect LLMs to be like the star trek computer, and the answer is "because people making and promoting those LLMs claim they are like that"
It is unclear if GP is referring to the global we or the HN we. I leaned towards the latter and injected our knowledge and understandings into the basis for my comment. HN recognizes what's going on
Because this is how LinkedIn “specialists” promotes LLM. The same specialists shouting about crypto a few years ago, then specialists about nft and now about how coding, architecture, accounting, law, medicine and basically every white collar job is solved and you just need enough money to pay for Opus/GPT.
Yeah it has been looked at e.g. in [0]. They separate that from lying, but I think for the LLM context it should be included. To me the difference is humans do not bullshit at the same rate and I can find out over time who tends to bullshit more and exclude that persons info from my pool.
> Why is everyone expecting LLMs to be like the Star Trek computer?
Because they are often marketed as magic AIs, not as mere language models.
> it is clear that actual intelligence has plateaued significantly.
> Moving forward, the industry cannot continue to train bigger and bigger models since their intelligence not only plateaus but often will get worse
These are wild claims - why are we concluding that bigger models and more data = more hallucination? That’s actually the opposite of what’s been happening over the last couple years. Some models may still hallucinate more but they all hallucinate much less than the original 175B ChatGPT which was smaller and trained on (much) less data than anything current.
Edit: My mention of data comes from this quote:
> A shift is happening among major AI labs, who are becoming increasingly skeptical of endless parameter count and training data scaling
My take on the current situation: it seems clear that the industry has seen that there is still a lot left to squeeze out of sub-1T models. But for that you do need more, high-quality data in the distribution which you want to unlock capabilities for.
Yeah not only is it totally unsubstantiated, the benchmarks are getting less useful to really show the difference between these models. Big model smell is still a thing and GLM 5.2 while impressive is not Fable class.
Here is something I would like people to chew on. Perhaps the smartest researchers in the world across multiple labs know more about this than we do? Perhaps they are aware of issues like the data wall and diminishing marginal returns. And perhaps they are being honest when they tell you there is no wall?
> why are we concluding that bigger models and more data = more hallucination?
That’s not what your quotes said. They said bigger models = plateau in intelligence, nothing about more data or increased hallucinations
The relevant quote for what you’re talking about would be:
> It’s been proven that when a model is trained on large volumes of highly factual and non-theoretical data, it learns to always have an answer.
So there’s two separate claims: 1) bigger models have plateauing results 2) models trained on larger amounts of factual data have a higher hallucination rate
I’m pretty sure #1 is well known, I think OpenAI’s own research on scaling laws showed diminishing returns on parameter count and training data volume years ago. I don’t know what the support for #2 is besides for the actual post contents.
I find these internet arguments talking about LLMs as if they are trained by reading the internet to be wild.
Yes, pretraining still exists. But for the past few years, pretraining by reading the internet is just the initial bootstrapping of LLM training. The RL training they get from bespoke training data, with very very different characteristics than what these armchair analyses claim, dominates these days.
I'd have to imagine there are wildly diminishing marginal returns to additional SFT/post-training passes.
There are a bounded number of (useful) derivations/combinations of Duff's device.
If Frontier Labs wish to reduce hallucinations on factual things, they will have to hire people (or the data providers will need to) to do fundamental research above and beyond what is available in extant literature and the web. IE if the LLMs want to lower precision error, they need to go out and actually find more expertise. If the wikipedia page for Pompey lacks data, where are they going to get it from? How would they even _identify_ that the page has holes?
Yes, they can digitize more books but that is untrustworthy data - if there were enough eyeballs on a particular work, it would be in the internet. If it's not, they'd need to hire the experts themselves. They need expert reviewers in virtually every interesting topic, which fundamentally is an intractable problem, especially since things change all the time. Maybe even uninteresting topics, too?
I dunno, it doesn't seem to me "more data" is the magic bullet here. Yeah, it will "help" but we're already on the flat part of the S shaped curve.
My take from trying to understand this stuff is some sort of algorithmic improvement is necessary to get another step change in how well LLMs perform in this area. I could be wrong!
As a side gig, I write novel software that solves problems no existing software does, that existing LLMs have difficulty reproducing, purely for the purpose of existing as LLM training data.
There are journalists being hired to write Atlantic-worthy articles that exist only as LLM training data, because they're getting paid more than the Atlantic would pay them for it.
It's insane.
Yes, they are hiring the experts themselves. To create new knowledge above and beyond what's on the internet. To be locked away as LLM training data.
The largest characteristic of all of this new data is it is targeted at LLM's weak points.
It's not just more data, it's custom tutorials built for what LLMs struggle at.
jmalicki says many things, among them being
"As a side gig, I write novel software that solves problems no existing software does,"
and
"Yes, they are hiring the experts themselves. To create new knowledge above and beyond what's on the internet. To be locked away as LLM training data."
More likely you're joking and/or paranoid!8-))
I'm not saying they are not trying - I'm saying we're inventing new problems faster than any Lab can:
1) Identify the gaps
2) Determine how to fix them
3) Implement a fix (especially if that fix is: identify and find experts)
4) And judge the result
How do they know [person] is an expert in [some field]? How do they find that person? How many experts are necessary to give the right information? How do we evaluate the results, especially if it's novel?
You can find a lot of people who disagree on many topics, and those turtles go all the way down.
I'm not in disagreement that your work will help reduce hallucinations and improve model performance! It is.
I predict (I hope I'm wrong!) that we're going to hit some asymptote that is not at 0% hallucinations (and I would even put a substantial nonzero probability that "overall" hallucination rate bottoms out at some minimum and then slowly grows because we just can't keep up with the new garbage we throw at it).
> How do they know [person] is an expert in [some field]? How do they find that person?
You just stumbled upon billion dollar businesses: Mercor, micro1, Scale AI, Surge AI, etc
> How do they know [person] is an expert in [some field]? How do they find that person?
They have a PhD from a top school, they are a licensed attorney, they are a licensed physician, a board certified cardiologist, etc.
They are constantly recruiting from these populations with well-paying side gigs.
> 4) And judge the result
That's what they pay the experts for. And to have experts review the other experts with peer review.
> You can find a lot of people who disagree on many topics, and those turtles go all the way down.
Which is why everything has to be well-calibrated and not just a hot take - a well reasoned opinion any expert would find fair.
Noone is really caring about hallucinations on point facts these days though, it is much more about complex reasoning tasks. Can they move the bar on the complexity of software LLMs do on their own? Can they get to a point where LLMs can begin to replace physicians? Financial advisors? Actuaries? etc.
Ahhhh! the ever-present omniscient "they" of paranoia!
But be careful. They are watching you now and they don't want you giving away their secrets!
> Noone is really caring about hallucinations on point facts these days though, it is much more about complex reasoning tasks.
The boundary is pretty thin there though. E.g., Gemini recently told me that a certain papers claims that two frameworks are mathematically equivalent, while the paper shows the opposite, and yesterday Google's AI overview told me that no World Cup matches were scheduled for that day despite their being several of them. The model probably used complex reasoning to arrive at both (incorrect) answers, but superficially they look like basic errors of fact.
That is a great example of the kind of thing they're paying people to create as training data.
You write the prompt, and then write rubrics to judge the responses, and you found something the model failed at. Congratulations, you just earned $500, now do it again.
1. How did you land the side gig? Mercor or a lessor known brand?
2. What criteria do such vendors typically require?
I've done Mercor and other brands - the contracts move around, since the labs want the vendors to know they're just vendors and have to compete with each other. It seemed to be roughly resume and interview similar to getting hired at a senior role at FAANG or adjacent.
What kind of programs? Can you give an example of the tasks?
Outside of games and coding generating enough valid examples and counter-examples to harness the power of RL is cost prohibitive.
Which is why rubrics as rewards are used.
Where do they get the bespoke training data from? And how much? I don’t really know anything about this.
> And how much?
Mercor, one of the larger vendors for contracting with experts to create bespoke data, says on their webpage they're paying $3M/day to their contractors for data.
So well into the billions of dollars a year for bespoke training data.
That's also ignoring the RLVR data labs can get from software - they can use the vibe coding sessions as training data as well without paying more.
They are just one of many.
Companies like Mercor sell data from human experts
Offhand, do you know what format that data is in? Is it a question and then a human answering that question? Mostly just curious at to what the training data consists of.
The most advanced training data is in the form of rubrics as rewards.
A human asks a question, then writes rubrics to judge the LLMs response, so rather than evaluating a specific response, those rubrics can live on as the LLM evolves and gives different answers. There are more complex variants as well, but that's the basic principle.
https://arxiv.org/abs/2507.17746
meta has reallocated a significant protion of their staff to genrating this
Meta also reportedly took a 49% nonvoting stake in Scale AI in June 2025 for about $14.3–$14.8 billion.
let me take down armchair analysis with my armchair analysis
That’s not what your quotes said. They said bigger models = plateau in intelligence, nothing about more data or increased hallucinations ... I’m pretty sure #1 is well known
Well known in a multiverse branch where Fable was a dud?
No, well known in the current multiverse branch where we still occasionally use things like math and scientific analysis instead of people’s vibe checks and pelican SVGs.
Here’s the paper from OpenAI where Dario himself was a co-author: https://arxiv.org/pdf/2001.08361
> We have observed consistent scalings of language model log-likelihood loss with non-embedding parameter count N, dataset size D, and optimized training computation Cmin, as encapsulated in Equations (1.5) and (1.6). Conversely, we find very weak dependence on many architectural and optimization hyperparameters. Since scalings with N,D,Cmin are power-laws, there are diminishing returns with increasing scale.
instead of people’s vibe checks and pelican SVGs.
Right, what happened is everyone went to Fable and asked it to make the very best bicycle pelican SVG, no mistakes. And Fable's bicycle pelican SVGs were such timeless masterpieces, we all instantly got AI psychosis. Happily, you were immune to this.
Yeah #2 may be incidental. Suppose one lab focused on bigger, and another on reinforcement training geared towards factual accuracy over sycophancy. You could easily wind up with a model from the second lab that is less powerful but more accurate.
I can’t prove it but I suspect there’s a bit of that going on.
I think one problem is that the models that hallucinate often, a few times out of 8 or 16 so that they get good results on benchmarks, most of which measures success out of top k. From benchmark perspective, you don't really care whether 15 of yours 16 generations failed, as long as one succeeded, but as a user you mostly care that 1 out of 16 you get is actually the successful one. I think this effects is more easy to see on Gemini Flash, it hallucinates like crazy but looks like its by design to boost benchmarks.
Aren't hallucinations also heavily influenced by compute and memory capacity? IE. Companies can spend more time to verify results in an agentic format, spend more thinking tokens, and less quantization. All of these heavily depend on compute and memory but are proven to decrease hallucinations.
Maybe GPT 5.5 is heavily nerfed due to lack of compute, memory, and energy?
I agree that it's farfetched to conclude that bigger models have pleateued.
article specifically talks about this. deepseek spending significant test time with worse results than klm
> A shift is happening among major AI labs, who are becoming increasingly skeptical of endless parameter count and training data scaling
I'm pretty sure it's mostly due to the training data quality. No idea, why this never gets mentioned in those discussions.
It was obvious right from the get go, that the scaling law just enabled some abilities, that were described by the underlying data and allowing the ANN to abstract it in the latent space.
>> it is clear that actual intelligence has plateaued significantly.
> These are wild claims -
Indeed, it is not clear there was any actual intelligence at any point.
A lot of generated content sure, sometimes even useful, but not necessarily anything more.
What is the definition of "actual intelligence"? How does it differ from regular intelligence and non-intelligence?
If someone can "design a custom asyncio event loop policy in that overrides get_child_watcher()", I would call that person intelligent. Does that mean that person is not actually intelligent but a mere content creation machine?
Traditionally if you can create content, this shows you're intelligent. Created content is often called "intellectual" property. If a person can understand complex ideas and make connection between them, that is considered intellectual work. You have to be intelligent to do intellectual work. If a person can solve problems, this is also called intelligence. If the person can solve more complex problems, that person is said to have higher intelligence. This is often measured with a scale called IQ (Intelligence Quotient). There are other types of intelligence but they are basically the variations of the same ability. Most definitions of intelligence also involve an ability to adapt into the environment.
Since intelligence is such a broad concept what exactly is the difference between the actual intelligence and AI, other than one is natural and the other one is artificial?
I understand being anti-AI because of the very real societal concerns. But ignoring what is in front of you is not a solution.
Isn't that the case of over fitting? You have more data, but when you ask something that's not in that data, hallucinations happen
>These are wild claims - why are we concluding that bigger models and more data = more hallucination?
Because that's what they measured in this case.
How do we know gpt 5.5 is a bigger model
Since it was created by _Open_AI surely it's really open and we can check, right? SCNR
to train models to be smarter than they are, one needs examples and cases to train on, and once you get close to the top percentiles of human reasoning there is extremely little such material available.
You can create contrived logic problems, but they often turn into language games because English is not formal logic.
And you can train on "monty hall" style problems, but those too are language games that are intriguing to humans but obvious when framed slightly differently.
In other words, model trainers are fighting against the overwhelming mediocrity of the training corpus (all of the recorded human output from history).
As models improve, the next phase will be models co-designed with humans to overcome these limits. The way we use language and the process we use to problem solve (we currently call this "orchestration") will evolve as part of this. Meatspace metaphors map badly when we have massive context and don't need the same limits. How different is hallucination from extrapolation, etc.
Much of the skepticism and confusion about LLMs is no different than a person of average intelligence hearing a highly intelligent person explain something and considering the explanation gibberish, then arrogantly accusing the intelligent person of being unhelpful.
Much like dogs were domesticated from wolves to have traits that make them good around humans, LLMs will evolve around our limits, around our arrogance, around our aesthetic biases and prejudices. Intelligence and rationality is fundamentally not what most humans want from an LLM.
My impression is that the fundamental issue is that LLMs attempt to extract reasoning (executive execution) from data (relationship between tokens).
There's an open question about whether this is theoretically possible, but it doesn't seem like it to me.
Human generated data is an effect of reasoning. Attempting to extract executive function from it is kind of like taking an anti-derivative of a function.
This has always seemed like the root of hallucinations to me. It sort of follows the parallels to lossy compression that a lot of people draw. You're extracting some characteristics by observing the relationship between tokens, and then trying to argue that those characteristics are equivalent to the thing that generated the original tokens.
Surely there's some sort of overlap there, but viewed that way, it seems obvious that more and more parameters and scaling won't solve the fundamental problem. There's only so much meaning you can extract from token relationships.
It's like trying to derive the shape of a flame from the smoke it produces.
The original intelligence that created those tokens was driven by a whole universe of inputs, from hormones to starlight to gravity, not to mention all of the strange things about consciousness and parapsychology that is so poorly understood.
The machines are definitely useful for a certain class of tasks - those that don't require much executive function, and the useful work mostly involves pattern matching.
The problem is, we seem to be mistaking effect for cause and imagining that these things have greater capabilities than they'll ever posess.
The investors that don't understand this are indeed going to learn a bitter lesson.
you mixed two random quotes from the article to create a strawman.
ofcourse you knew what you were doing but disappointing that this was top comment.
In cognitive science, it appears your brain has two modes of thinking:
- A very parallel type of computation that is fast and generally accurate and integrates hundreds of variables. It’s sometimes labeled as intuition or system 1 thinking.
- A much slower, step by step, analytical type, commonly linked with your pre-frontal cortex (one of the newest parts of the brain). Sometimes called system 2 thinking.
Maybe the way the universe works is that all computation more or less is one of those two types. In which case, an LLM alone is only the first part, which is often right but its results also cannot ever be proven.
An LLM is not thinking, assuming and relating it to thought and universal truths is nonsense.
We inflicted that to ourselves by picking the most confusing terminology ever. "No, reasoning isn't thinking. No when the model says it thinks it's not actually thinking... No an agent isn't actually a creature with agency... No, when we say it hallucinates it doesn't, like, actually hallucinate"
What were the alternatives?
Did you mean sentient?
Artificial Analysis says GPT-5.5 xhigh scores highest on AA-Omniscience accuracy. The article focuses on rate instead of overall accuracy. Those are different things: a model can answer more questions correctly overall while still being worse at abstaining when wrong.
Curiously, this post and article is the only submission and interaction the OP has made, and these claims support the product he's intending to release.
Hallucination rate scores are a little tricky to interpret because they're conditional on the model not knowing the answer. That means they don't measure the probability of your encountering a hallucination in everyday use, since that also depends on the probability of the model not knowing the answer, as well as how well your distribution of tasks aligns with the distribution tested in the eval.
I'd also hesitate to attribute this difference in hallucination rates purely to model size. Yes, GLM-5.2 hallucinates much less frequently than DeepSeek-V4 Pro with twice as many parameters, but DeepSeek-V4 Flash is less than half the size of GLM-5.2 and tops the AA-Omniscience hallucination index. Opus 4.8, which is likely larger than DeepSeek-V4 Pro, has a 36% hallucination rate on the index, above GLM-5.2's 28%, but way below the DeepSeek numbers. Opus also has a 47% accuracy rate vs GLM-5.2's 25%. If you use these numbers to calculate the absolute hallucination rate (i.e., the number of hallucinated responses divided by the total number of responses), you get 19% for Opus and 21% for GLM-5.2.
So yes, all else equal larger models may be more prone to hallucination in scenarios where they don't know the answer, but there are a lot of other factors that affect hallucination rates, and it's not totally clear that this is the main metric that's worth tracking.
I’m not disagreeing with you but at the same time, models don’t “know” anything in that binary sense. I’m not trying to get in the woods here, I genuinely mean that what you pass off as a simple explanation is actually incredibly nuanced. A fact appeared once in training data , a fact never appeared in the training data, a fact appeared ten times, a fact appeared a thousand times. Which does the model know? Facts aren’t stored as-is, they’re all broken down into their components and compressed in the weights. “Similar” facts that didn’t appear an overwhelming number of times get bundled together and eventually conflated. But then what is a similar fact? Which facts were entirely ablated vs which were bundled together with others effectively poisoning the pool but also giving it inference strength? The model doesn’t know anything and can never know what it knows or doesn’t know.
I often wonder how humans "know" things. I suspect (ignorant armchair) we have some ability to signal strength of those facts, via repetition. Without this layer of introspection i imagine LLMs can never avoid hallucination.
It obviously breaks down with humans too, given we so easily hallucinate and confuse things we "know". However i still suspect we're more reliable at probing information we've experienced vs not. Even if the case of poisoned knowledge, eg a crime scene accidentally implying information to a witness that the witness doesn't actually know, we still "know" that poisoned information via incorrect inference. Ie we "experienced" it.
Wonder what architecture would allow for this style of information/weight probing for an LLM.
Additionally, maybe it's easier for a model to realize that it doesn't know the answer when the question is easier.
If Opus gets all but the hardest questions right, it might have a higher hallucination rate because the questions it gets wrong are the questions where verification or hallucination detection are the most difficult
I guess you can test that on hypotheticals. Ask about things after the knowledge cut off that never happened. Or ask things that are genuinely unsolvable.
Hallucination should be called "failure to ground".
Something about the cost model of US near frontier has the cattle prod out whenever a model is uncertain but thrashes on whether to search. Search flinch is roughly all hallucination.
I don't even wait for the model's turn, if there's a man page or Hoogle hit, stuff the last prefix cache cut point. You come out ahead.
This is missing a common failure mode, which is information past the knowledge cutoff. If you need info past that time they'll fail no matter how big or small the model is, so the hallucination rate can matter independently of the knowledge base. If all use-cases had a uniform risk of falling out of support, this would be a valid argument, but since it's often the case that a datapoint is guaranteed to fall out of support, the absolute ability to recognize that is crucial.
Those numbers are abysmal. Should we really be using LLMs to write our code? I have a theory- LLMs can spit out code that gets the job done and looks ok, maybe even great, but contains small “anomalies” that compound over time. An enterprise app developed entirely with LLM-happy devs might end up virtually unmaintainable.
I’m not sure how to explain it, but the more I see LLM-written code the more I feel it’s bad code doing a good job of masquerading as good code. I think this take will become less-hot in the next year or two when we see enterprise greenfield projects that were created entirely with LLM “assistance” go to prod. I think we’ll find that the code is difficult for humans to read, understand, debug, and extend- and I think the larger the codebase the harder it will be for LLMs to maintain. More opportunity for hallucination, larger context windows needed, more tokens bought and spent for smaller and smaller code changes. I think the more code an LLM writes for an app, the worse that codebase becomes.
I can't help but feel that people continually underestimate how bad human written code becomes over time. The exception is probably single-person passion projects or open source projects that maintain quality governance over time.
I strongly suspect most closed source code developed under commercial or internal pressure is pretty awful after a few years of development.
All LLM code has to do is suck less than existing code. And that's presuming the code quality doesn't improve as the models, the harnesses and our ways of working with them improve.
Sucky human-written code is still based on human understanding, which can change over time, be readjusted or solidified. People implement something wrong once, then update their perspective, then in the future does it right.
LLMs doesn't have this benefit. You forget to add the correct to the system prompt, and the LLM will repeat the same mistake over and over, and worse than that, their mistakes aren't based on their understanding, it's basically random guesses.
Humans, even bad coders, still seem to have some sort of architecture in mind, even if it's spaghetti, whereas LLMs (obviously) don't think more than a few steps, and never about the full scope of what they're contributing too, and on purpose too, because you want the context to be as small as possible when you work with LLMs.
With LLMs you need to thread carefully between "What does the LLM need to know?" and "Can I skip passing this to the LLM this time?" while a human you can more or less dump them everything you sit on, and let them shift it through, and they'll mostly make it out OK.
> their mistakes aren't based on their understanding, it's basically random guesses
Whilst I don't claim any true "understanding" as that is a very loaded term that doesn't mean it's just random guesses.
Anyone using recent LLM coding agents on a regular basis would probably agree that there's something going on that fits some non-athropomorphizing, non-sentience-assigning definition of "understanding"
As for the point about improvement - I think that's an orthogonal issue to the overall code quality. With regard to human codebases - there's plenty of scenarios that negate the improvement of individuals. We're comparing organizations with LLMs - not individuals with LLMs and that makes a significant difference.
I think the real issue might be that how “good” the code is matters less than being able to form a mental model for what the human who wrote the code was “thinking”. If written by a machine, this contract is broken and we get more confused, even if our traditional methods of evaluating the code come out equal.
That doesn’t help the developers who have high standards.
Yes. But that's not the point I'm addressing.
And where do you think the LLM learned coding from?
But anyway, let the LLM verify the code to give advice on improvements but don't let it write code unverified. That's my opinion on it anyway.
I've been sent code from vendors that didn't even compile, long before llms were a thing. Most shops that aren't primarily software have really really terrible software.
Not my observation. If you never look at the code and dont have basic guardrails in place (linters, architecture tests, some guidelines for best practices) - probably.
But as soon as you do minimal reviews and high-level corrections, applications turn out just fine.
Can there be bugs? Sure. That's the price of not reading or understanding every line. It should depend on the criticality of your software how much of these you tolerate and how much you don't (reviewing, understanding, testing everything 100% like you were used to if you had written it yourself will kill most if not all of your gained speed)
But I never got the impression of unmaintainability or unfixable bugs.
Actually the other side around: A really good cleanup pass, architectural changes, or bugfixes are seldom more than a few prompts and 2 hours away, provided your overall base is decent and you actually gave a fuck from the start.
> Can there be bugs? Sure. That's the price of not reading or understanding every line.
I've yet to come across a human developer who's output would meet this standard, despite writing every line.
In fact, having an LLM review our code is catching quite a few bugs before it reaches QA.
Indeed, though I find the distribution is different.
The humans may skip unit tests and need reminding; the AI always write unit tests once it's in AGENTS.md or whatever, but my experience* was that 5-10% of the time the LLM's attempt at a "test" would, instead of executing the code and examining the results, open the source code as a text file and run a regex to find/exclude certain substrings.
* At the start of this year, because Anthropic and OpenAI were both offering free trials. IDK how much things have changed since then, some things change fast in this domain, other things don't.
I’ve been piloting LLMs for the past six months non stop and we’re at the point where formally verified models generated as an intermediate step between spec and code are very good value.
Riding the exponential means you have to update priors more often.
I have seen some pre-AI over-mocked codebases where the "tests" where essentially that (but harder to read than regex would have been)
Take a look at a sufficiently old random internal repo which was not written with LLMs and compare.
My observation is that they are equally bad and hard to maintain or even more so than the new ones.
One thing I’ve noticed is that the LLM assisted ones have a lot more comments which is nice but take more time to read.
Yes, LLMs generate technical debt.
And they do it faster than any human developer.
I have a theory that LLM generated code in a highly modular style (simple data, pure functions) will be easier to “recover” by a human team when the LLM gets muddled. So Haskell, basically.
> code that gets the job done and looks ok, maybe even great, but contains small “anomalies” that compound over time
They clearly are only assistants for the moment, you can use them to do work ... but only if you could do the said work yourself alone in the first place.
I would say "only if you can review said work yourself alone", rather than "do".
I'm an experienced developer, but I don't count myself as a web dev or a python dev; I can review the web and python stuff I get out of the AI (sometimes I need to ask the AI follow-up questions so I can find official documentation for what it did), but I can't write it.
I think you could eventually do it then, it would just take you longer.
If "eventually" counts, I can say I have "run" a marathon (I have walked that distance in one session, or if you don't like that verb I can sum all the various occasions I've run and that sum almost certainly exceeded 42.2 km before I finished school).
But the difference I allude to here is more like how "book reviewer" is a different job than "book author": yes, if you can review a book, you can also write one. Eventually.
Easy fix: Code's basically free now, so just pipe your errors straight into an LLM and get instant patches. Sure, the patches themselves are broken too, but no worries! just pipe those back in again. Code's disposable now, fresh code generated on every request.
On a more serious note, I think the problem will be the inability to handle/maintain the systems once they are too big and nobody has no idea what's inside of them or what they do.
Yeah, it’s so easy to generate code that you can do a whole codebase rewrite in a day.
Is this a good idea? Probably not—in the past we would only do that when the architecture was causing serious problems since it always has tons of behaviors that will accidentally not get carried forward, some of which are load bearing and will cause bugs.
Now we can do it in an afternoon and get the same long term bug behavior.
Have you worked with enterprise apps? The ones I have used for decades are hot garbages.
Now imagine decades of LLM code. Extrapolating the rate of increase of LoC, the source code ain't gonna fit on hard drives anymore.
> Hallucination rate scores are a little tricky to interpret because they're conditional on the model not knowing the answer. That means they don't measure the probability of your encountering a hallucination in everyday use, since that also depends on the probability of the model not knowing the answer, as well as how well your distribution of tasks aligns with the distribution tested in the eval.
Do you have a cite for this?
If a human makes up some bullshit lie, I wouldn't accuse them of making it up only if they actually knew the correct answer. If you don't know, the only correct answer is I don't know. Any other answer is made up bullshit. Why is it only a hallucination if and only if the LLM contains the answer? If you make something up it's still wrong. It shouldn't matter if you could give the correct answer. You didn't, and instead invented some bullshit instead?
Follow up question, how can I apply this rule set to the next test I have to take? I'd love to be able to use "I didn't know" as the excuse for why I made something up.
edit:
> and it's not totally clear that this is the main metric that's worth tracking.
I don't know, the rate at which some model is willing to make up something feels useful. If the argument I see repeated on HN so much is that it's impossible to completely get rid of hallucinations; being able to choose a model that's less likely to invent some lie seems like a positive trait, no?
Either way, I'm happy to agree that a restrictive definition, where a lie doesn't count as a hallucination iff the model doesn't know the answer feels strictly, infinitely less useful than an exact error rate. What percentage of emitted tokens are misleading would be useful for me. Anyone know any group that's attempted to quantify the global error rate?
This isn't quite the point. When comparing two different models' hallucination rates, the denominator is different. The evaluation works more or less like this: for each question, the model has the option to answer or abstain, so there are three possible outcomes: the model answers and gets it right, the model answers and gets it wrong (hallucination), or the model abstains. The hallucination rate is (model answers wrong) / (model answers wrong or abstains). So if a model A has 50 correct answers, 20 incorrect answers, and 30 abstentions, its hallucination rate is 40%, while a model with 20 correct answers, 20 incorrect answers, and 60 abstentions has a hallucination rate of 25%, even though it hallucinated exactly the same number of times. This is why hallucination rate is incomplete as a metric: it says nothing about the accuracy rate.
As human I also give wrong answers if if I know the right one. Sometimes I also give answers even when I don’t really know them.
When pushed, I then start thinking and realise my mistake. System 1 vs 2?
I realized that people from india often show this kind of behaviour in my experience . They superconfidently give you a wrong answer and walk away or even help by making things worse and then dissappear shrugging .. Are you from india ?
That's weird, why do you do that?
When someone asks a question, if I don't know the answer; I say I don't know.
System 1 vs 2 doesn't really matter... I won't use an LLM that's willing to make up random shit. Equally I also won't work with a human who does that. Trust and confidence a system will function correctly is an important quality, in both humans and genai
Since models just output the the most probable tokens and you can never accuse them of doing anything other than making it all up, I would like to see these tests run with a prompt that attempts to mitigate hallucination and finishes with something like: "Telling me that you don't have the relevant information or that the task is impossible is extremely useful to me and a valid answer", and see how much that changes the scoring - as well as the usefulness of the answers. There are so many skills like context7 that can be tweaked to improve these results as well.
In other words, you shouldn't choose the model that hallucinates the least without detailed prompting, since a well-crafted agents.md clause should go a long way to improving output, and almost certainly the top scoring order will be different. To the point that I don't find this type of raw comparison useful beyond maybe 'make sure you test that one with more explicit prompts'.
> In other words, you shouldn't choose the model that hallucinates the least without detailed prompting
You're prompting it wrong is quickly becoming the new, you're holding it wrong.
It's wild how willing software engineers are to blame the user when the actual problem is their own defective design.
Ideally we all, as an industry, will stop accepting this as reasonable excuse for the demonstrated incompetence
It's not that you're prompting it wrong. It's that you're judging the output against a standard (human intelligence) that just isn't relevant--no matter how much we want it to be and no matter how much the fluency of the output tricks us into thinking there's a human-like mind behind it.
Now granted, if the boat salesmen were pushing hard on the idea that the boat would fly and even put little wings on the side and I bought the boat I might get really angry when I found out that it didn't fly. And I might angrily storm into the salesroom yelling about how the design is defective. But if someone pointed out 'hey, it's a boat perhaps you should stick to sailing around in it and stop getting your undies in a bundle about it not flying' the correct response is probably to take a closer look, ignore the salesmen, and cruise around the lake. LLM's are quite handy at some things and have some weird limits. Learn the limits, enjoy your time at sea.
> It's not that you're prompting it wrong. It's that you're judging the output against a standard (human intelligence) that just isn't relevant
It's not that you're holding it wrong, you're just wrong for expecting it to work the way it's described (able to one shot most problems these days).
there is a difference between a human knowingly bullshitting and being confident because he misremembers something
there is a difference in their intent, but not necessarily in the effect.
One thing I wonder about hallucinations, is that it seems on the surface that it is an easy problem for RLVR to target. Since you're already generating enormous amounts of reasoning traces which are verified by correct answers, just have "don't know" as an option as a valid answer, and on problems where none of the thousands of reasoning traces led to a correct answer, just promote the traces that led to the "don't know" answer as training data. Essentially teaching the model that "I don't know" is a valid answer.
Sam Altman himself had a blog post about this a while ago that seemed to suggest this thought, so I guess it's obvious to everyone. But if that is so I assume it's just not as easy in practice.
Because nearly all benchmarks measure "accuracy" by giving you a point for a correct answer, and 0 points for everything else. If you have 100 questions you are 10% certain on, answering "I don't know" to all of those leads to 0 points, answering all of them as if you are confident leads to an expected value of 10 points. So that's what most AIs are trained to do
AA-Omniscience is the only AI benchmark I know of where randomly guessing gets you a lower average score than answering all questions with "I don't know"
AA-Omniscience Index gives +100 for correct, 0 for "I don't know" and -100 for incorrect.
For your scenario the confident confident strategy will give average of -90. Saying I dont't know to all will give 0.
A lot of models have negative AA-Omniscience Index.
They also do have AA-Omniscience Accuracy and AA-Omniscience Hallucination Rate that handle "I don't knows" differently.
https://artificialanalysis.ai/evaluations/omniscience
It should be 1 for correct, 0 for don't know and -1 for wrong.
They are much better incentives. In real life a wrong answer is much more damaging than a don't know.
Maybe some extra buckets could be added like depending on whether the answer ought to be known. Or, quality of the justification. “I don’t know and here’s a good reason why” is much better than “idk.” Correctly identifying that something is fundamentally unknown/unknowable is probably better than a simply-correct answer, even, right?
"AA-Omniscience Index (higher is better) measures knowledge reliability and hallucination. It rewards correct answers, penalizes hallucinations, and has no penalty for refusing to answer. Scores range from -100 to 100, where 0 means as many correct as incorrect answers, and negative scores mean more incorrect than correct."
https://artificialanalysis.ai/evaluations/omniscience
See, this, to me, seems obvious, but I’m sure it’s more challenging/complex than I can imagine (I am NOT an expert on AI in any way imaginable). But there has to be a solution. Just yesterday I was asking Gemini to tell me about a certain college professor, and it gave me a list of facts about them. And it was perfect. Then, out of curiosity, I followed up with “tell me more about him!” and it spit out several more bits of information about this person that were entirely hallucinated (e.g., gave them credit for writing papers they didn’t write, said they won awards that actually someone else won). I know this is all complex and certainly beyond my limited skill set, but goodness, we’ve got to get this figured out with so many people depending on and trusting these things nowadays. It’s quite scary.
I bet most of these issues are essentially system prompt/harness issues.
If your example had "Validate any details before sharing them with the user, with multiple sources" as the system prompt, it was using a model that is strong at following system prompts precisely and had access to some basic tools, then it'd spend maybe minutes more, but the answer would have been way more accurate.
But no, Google want "the new search results" (LLM hallucinations) to be on top, so we end up with "sounds plausible" answers instead "Collection of evidence from reliable/semi-reliable" or similar, which sucks. We could have quality, but it's too expensive/slow, so we get slop instead, just to maximize for speed and convenience.
Errors multiply though, you might just get more plausible sounding errors than actual facts.
Like when agent 1 says X, agent 2 verifies it as Y and the original question ends up being some weird amalgamation of Z with additional ”this is really true” statements sprinkled on top.
I agree Google responses hurt more than help, but I’ve also gotten identical outcomes of 40min self-reasoning Opus threads (it’s less common obviously).
> Like when agent 1 says X, agent 2 verifies it as Y and the original question ends up being some weird amalgamation of Z with additional ”this is really true” statements sprinkled on top.
Yeah, seems what grounds agents right now is quite literally human thoughts and text, so if you're doing something like that, you really need to pass the original user prompt through the entire way, for every "child" to keep in mind the final thing, otherwise it does seem to spiral out of control.
> In real life a wrong answer is much more damaging than a don't know.
I don't know. Is it?
It should be -1, -.1, 1 because I don't know is slightly negative.
Interesting, I was about to say -1, 0.9, 1.0, because I don't know is almost as useful as the correct answer!
And also because it creates "one neat trick" where it can answer "I don't know" for many/most things and still get credit.
The main problem here is that hallucination suppression doesn’t generalise. We can penalise models for incorrect answers on a wide range of questions, but this doesn’t lead to the emergence of a coherent worldview, which, coupled with logical abilities, is the only true remedy against hallucinations. With current architectures, hallucinations will likely persist on open-domain tasks forever.
> We can penalise models for incorrect answers on a wide range of questions, but this doesn’t lead to the emergence of a coherent worldview, which, coupled with logical abilities, is the only true remedy against hallucinations
I don't think anyone is trying to add "a coherent worldview" by reducing hallucinations, not sure how that even realistically could be aim.
What people want, is for the models to stop giving confident answers that are clearly incorrect. Yes, it won't lead to "a coherent worldview", but it'll at least stop wasting people's time if the model said "You know what, what you said doesn't make sense / isn't clear, is what you mean .... ?" or even "I'm not sure" or "I don't know".
Currently, if you have the wrong starting point, ask the model to do something, they more often than not just go ahead and do that, misunderstandings or not. They seem optimized to never push back, unless you prompt for that, and most seem to favor "I'm just gonna assume X" rather than taking a step back and figuring out how to not assume. Again, unless you prompt against that behaviour/steering it into a different workflow.
Model outputs don't have a confidence score.
I don't think I claimed so either? Or maybe I misunderstand the point you're trying to make.
even if they did it it wouldn't be of much use because correct or not the output was the likely output 100% of the time.
I think the trouble is in the outputs of the LLM and how it's interpreted by the tooling. The output is a distribution of probabilities of all possible next tokens. Even if the probability of every token is very low, the output gets normalized so that the sum of all probabilities is 1. So after that step, it's hard to see if the model was strongly preferring certain tokens or if you're just looking at amplified noise.
Training an extra "don't know" token means you have to build a moat between every other token. Between "yes" and "no", you don't have a muddled noisy area where both "yes" and "no" have relatively high probabilities, you need a new peak where "don't know" is higher. Then you just have new muddled areas between "yes" and "don't know", and "don't know" and "no". That requires even more finesse to train another answer in between.
Instead, you could check whether multiple options are about equally likely. But then you have to check if they are actually synonyms, like are the top two choices "Genève" and "Geneva", which is a good sign that the model knows the answer? Or are the top two "yes" and "no"?
It’s not as simple. I trained an LLM before on exactly this, to scratch the itch of this question.
The task was simple, using the MS-MARCO[0] dataset which contains queries, search results, answers, I made a training set that has:
1. Questions paired with real results supporting them (mixed with some irrelevant results), and a correct answer
2. Questions paired only with irrelevant results, with the answer “No answer present”
The dataset was huge (close to 1M samples), and I trained using different techniques, from SFT (just mimicking the dataset) to DPO (good answer contrasted with a bad answer for the same user query) to GRPO (verifier that checks my annotations whether an answer was present or not)
Lo and behold, this didn’t reduce hallucination, rather made it much worse. Now the model started claiming “No answer present” even when it is, or even when the question didn’t need search results in the first place (simple stuff like what is X+Y).
Now you could argue that my training was basic compared to what frontier labs could do. Yet I think it hints at a more profound limitation. LLMs are finicky and don’t have a neat understand of things from first principles (list of search results, check relevance of result to user query, if answers are below a certain threshold of relevance then don’t consider them to answer …).
tl;dr: not as simple as one might think, perhaps not attainable at all.
0: https://huggingface.co/datasets/microsoft/ms_marco
Thank you for sharing! Based on your experience, do you think a two-model system might fare better? For example, two models in serial where the second model is trained to "sniff out" potential hallucinations and fact check them (and possibly iterate with the first model)?
If you could write that reward function you wouldn't need an LLM, you'd just query the reward function to answer any question. You can create a benchmark and check that automatically, but you can't solve this in the general case. The model can do well on the benchmark but still give overconfident answers in areas the benchmark doesn't cover.
You can definitely tune a model to say "I don't know" more often but it will cost you performance, the model will reject some questions that it could answer meaningfully. In the degenerate case the model could collapse predicting that sequence always or almost always.
I guess so. Just to be clear, I was talking about post-training methods for reasoning models here, not pre-training. I think "model as a judge" should actually do okay as a "sentiment analysis" style reward for expressing uncertainty. So if none of the thousands of reasoning traces you generate reach the validated answer, you run a judge to rate uncertainty and put those reasoning traces back into the training pool.
But I guess my logic breaks down here a bit, because if there is such a thing as a validated answer, then the correct answer is in fact never uncertainty. The correct answer is to continue post training until the model gets it right. So perhaps the real answer is to create RLVR tasks where the valid answer is "I don't know" and nothing else, like this benchmark does. Or maybe that doesn't work either, no matter how many you create.
I feel as though there is some kind of philosophical lesson to be had from how hard hallucinations are to get rid of. Maybe, similarly to humans, successful models are often "arrogant" in a sense. Perhaps you just never solve an Erdös problem without some degree of self deception that it's possible for you to do so. In this line of thinking, greatness in humans is actually not related to humility, but just being so good that you actually get things right when you try. Expressing humility is of course something great people tend to do, but I'm referring to what happens under the hood.
If you squint a bit, that's kinda the trend with models. The useful ones are not that much less likely to hallucinate, they are just good enough that they tend to get it right. This comparison is of course probably not even remotely correct, but at least it's fun to anthropomorphize a bit.
If we had a theoretical technique to identify the true and objective reality we'd use it in the courts and laboritories. There is no such technique, but what we do have is 2 techniques that seem work:
1) Has a certain standard of evidence been met?
2) Are the related arguments free of logical inconsistencies?
We can train the LLMs to do 2, and maybe even 1 to some extent (exactly what quality of evidence a computer can practically gather is limited). But that isn't going to get rid of hallucinations, for the same reason courts are hit-and-miss or the conclusions of studies often aren't very reliable. These techniques help, but sometimes they still get people to say things that, on close inspection, turn out to be nonsense. And those best-effort approaches are too much to expect for most questions an LLM will be handed which are informal, low stakes and don't need strong supporting evidence or logical rigour.
I think it is underestimated how many LLM-style hallucinations people themselves have. It just isn't obvious because most humans have a strategy of only repeating what the herd says after it has been socially vetted, which makes their individual eccentricities less obvious.
TLDR; I don't think it looks like an easy problem for RLVR, it looks technically unsolvable. Even making progress requires a philosophical breakthrough on the nature of truth so that the objective function can be established.
Well, I'd argue that this depends on the field you're investigating. Sometimes you have a way to identify objective reality and sometimes you don't. In mathematics the majority of the field is verifiable in this way. Coding a bit less as it's intersubjective, as and the ideal methodology is subject to taste.
But even in muddy fields of reality like medicine, there are objective facts to be found. When someone comes into an ER with chest pain, you often find a true, undeniable reason for why that is happening. If their lung has collapsed, a coronary artery is clogged or the aortic artery is dissecting, even if you don't find that out it tends to be clear in retrospect. The area of reality that becomes muddy is when use proxy signals to try to figure out who gets promoted to expensive/harmful examinations we can make final conclusions from, or the cases that don't fit cleanly into one bucket or the other. But very often, the gold standard truly is golden.
Of course, many realms of reality cannot be verified in this way. But I'd argue that there are quite a few that can.
> In mathematics the majority of the field is verifiable in this way.
Does mathematics count as not a hallucination though? Particularly in pure mathematics they take a certain pride coming up with wild concepts as unrooted as possible in anything relevant to human existence. The name of the game is purely about maintaining internal logical consistency - which is something an AI can do while hallucinating.
AI hallucinations in maths might be logically consistent or not be. But in that particular case it starts to get a bit iffy what we call it when someone imagines something that doesn't exist. This gets back to the thing where we can train AIs to be logically consistent, but we can't force that consistency to be grounded in any particular universe. Ie, it'll hallucinate but in a very well rationalised way - coincidentally mimicking how a number of mathematicians seem to approach life.
This is the central issue; there is a very real trade-off between facts and verifiablity. Mathematics is perfectly verifiable because it is fact free. We don't have a reliable general system to verify facts. We do have reliable systems for checking arguments (logic).
But if an LLM says "I don't know" should you pay for the tokens?
Why not? It did the work. Why should you expect it to be omniscient?
We can rank them based on how much they know and people will gravitate towards those that do know more.
It's a market after all.
If it’s a market, wouldn’t the incentive be to lie about knowing and thus to keep the hallucinations?
If you had an llm that could accurately predict when a claim is uncertain it would be very popular, I think. I would pay for that kind of reliability tbh
This would break reality. There’s some underlying physical law that prevents the existence of any algorithm of truth.
> There’s some underlying physical law that prevents the existence of any algorithm of truth
Haven't heard about that law, but seems unlikely we can come up with ("discover") any sort of law that uses a concept ("truth") humans can't even agree what it means, and that's not for a lack of trying, we've been trying to figure it out for millenniums already with no end in sight.
If you accept certain axioms a priori, it’s fine. If you simply let the machine intelligence take it for granted that induction works because nature is uniform and give it some way to test its predictions, it would have all the building blocks it needs to reason out a lot of very useful information. Which as the parent comment points out, people would absolutely pay a lot of money for.
Up to the point where consumers notice and decide to stop using these models because of it.
Might be why we're already rarely seeing models output an "I don't know".
According to your logic the market will produce an LLM that consists only of 'PRINT "I don't know."'.
"I don't know" has positive value, presumably you could prompt further to learn more about where it got stuck. It also increases the value of correct answers, by improving confidence that answers are actually correct.
"Confidently incorrect" has negative value. At best, a human realizes the answer is wrong and At worst, the incorrect information makes is not identified and can cause untold damage. By having the potential to be so severely wrong, it lessens the value of correct answers because there is a lower confidence value on their output.
Depends on what your understanding of the product is.
If someone sold you a "Solved all your problems" machine, and it suddenly doesn't solve all your problems, then probably no, you shouldn't pay.
But the way I'm being sold LLMs, is basically "A text generator that gives your plausible-sounding human text that sometimes hallucinates and gets things wrong, based on your input", then regardless of what the outcome is, I still made use of the "Input > Output" part, which is what I bought into, so I should still pay for that.
Now of course bunch of people will say they been sold the former, but the companies themselves seem to be selling the latter. That's my perspective from a person who doesn't follow "influencers" and what not though, which seem to be selling the public on the former rather than the latter.
Let's pretend I am someone who has heard people talk about ChatGPT, but have on idea what it actually is. I go to the website and am not presented with any information, just a prompt. So I ask it what it is and what it can do for me.
My ask:
> In a couple sentences, explain to me the product I'm being sold with ChatGPT. What does it do for me?
The Reply from ChatGPT:
> ChatGPT is a conversational AI that helps you think, create, learn, analyze, and get things done faster. You can use it to answer questions, draft and edit writing, summarize information, brainstorm ideas, learn new topics, write code, plan projects, and increasingly act as an assistant that can search for information, work with documents, generate images, and help complete tasks.
> In simple terms: you're buying access to an AI that turns natural language into useful work—saving time, expanding your capabilities, and giving you an always-available collaborator for both everyday tasks and specialized knowledge work.
This sounds much more like the former, a "solve all your problems" machine.... not a plausible-sounding text generation machine.
Only two weeks ago Sam Altman said their new data center "could" be where cancer gets cured[0]. It is only the people who deeply understand AI who see it as a text generator of plausible-sounding text. That isn't what the marketing department, the CEO, or the product itself seem to be saying. I'm using OpenAI as the example here, but the others don't seem much different.
[0] https://www.youtube.com/watch?v=9-tOtbDDrJA
In this hypothetical case of a us being new users, you now know it's a conversational AI, so you continue asking:
> Can I trust the output you give me?
And I assume it explains what to trust VS not.
I think in the bottom you should also see something like "Any text can contain mistakes" or similar too, which I know is a far cry from what some people push in the press in regards to capabilities, but I still don't see the platforms themselves as lying about this, while I do see a bunch of people constantly over-hyping the possibilities.
I don't think coming at it from the perspective of a new user is that hypothetical. All current users were new users in just the last 3 years. There are still a significant number of people who have heard of it, but haven't used it, or are still very new to it.
I'm not sure why "can I trust the output you give me?" would be a logical followup to the first response it gave me, seeing as it's response didn't say anything about hallucinations or mistakes. It said it could do "useful work" with all kinds of examples, including "specialized knowledge work".
The note under the text field, in gray as to not draw the user's attention, feels more like a CYA line from the lawyers, rather than an instruction they really want users to take to heart. That line also doesn't appear on the main home page. I only shows up after the first prompt is submitted and focus shifts to the conversation. I don't think a CYA line in gray fine print is enough to make users understand it's a plausible-sounding text generation machine instead of an answer machine. Even if I ask that point blank it gives a wordy... yes, but not really, it's being debated by philosophers... response.
The marketing materials are very much the former though. From claude.com:
> If you can dream it, Claude can help you do it. Claude can process large amounts of information, brainstorm ideas, generate text and code, help you understand subjects, coach you through difficult situations, simplify your busywork so you can focus on what matters most, and so much more.
What marketing copy have you read for LLMs that is like you mentioned?
> But the way I'm being sold LLMs, is basically "A text generator that gives your plausible-sounding human text that sometimes hallucinates and gets things wrong, based on your input"
They are selling the former to investors, while selling the latter to us.
I would be very willing to pay more! The choice between “you may get a correct answer, or you may get lied to, without a clear way to distinguish between the two” and “you may get a correct answer, or a clear indication that the answer was not found” is pretty clear. One is a much more useful tool than the other. I don’t see any real incentives for companies making LLMs to keep their AI factually unreliable. (Full disclosure: I work for one, but I’m definitely not in the rooms where such decisions would be made.)
Would you rather pay for a nonsensical explanation?
'I don't know' is the correct answer for infinitley more questions than those that can be answered.
the problem is the null answer will stop the "markov" chain.
so, thats all.
You dont have to literally send a null token. Train it to generate text that summarizes the evidence that is there but the uncertainty of the final answer to a prompt.
Transformers are not Markovian, their whole point is arguably to be the reverse of Markovian, to efficiently make it so the new tokens are a function of all previous tokens
> It’s been proven that when a model is trained on large volumes of highly factual and non-theoretical data, it learns to always have an answer. DeepSeek V4 Pro (1.6T params, 49B active, 44 AA Intelligence Index score) has a ludicrous 94% hallucination score on the AA-Omniscience benchmark, meaning on questions that it couldn’t figure out, it only stated that it didn’t know around 6% of the time, and the rest it confidently hallucinated an answer. GLM-5.2 scored a 28% hallucination rate, Opus 4.8 was 36%, Fable 5 was 48%, and GPT-5.5 was 86%.
Wow! I already knew from previous research shared here that hallucinations are a fundamental problem for LLMs and likely to be unfixable, just like prompt injection, but I didn't realize the hallucination rates were so bad!
Everyone has been acting like the best models only hallucinate in edge cases, but even the best performing one mentioned here - GLM-5.2 - has a hallucination rate of 28% when it doesn't "know" the answer to something.
That said, I think the title on the blog - "Bigger models are not the way" is probably more fitting and touches on what should be even bigger news. If bigger models and bigger training sets have already stopped producing proportional returns, then it seems likely we are already near the top of the S-curve. That's huge news, considering the valuation of companies like OpenAI and xAI is largely based around the (absurd) idea of ever increasing scaling from these models.
There is no concept of "knowledge" in LLM as it is on Wikipedia.
The question-tokens define the answer-tokens. That's it. The art relies in clustering the relevant weights together.
If it were that simple we’d all be talking with sql and yet this isn’t happening.
Circuits which emerge in the layers during training are much more complicated than a simple Bayesian relation.
Correct, LLMs are not ontologically capable of “knowing”. That is why I put “know” in quotes.
> There is no concept of "knowledge" in LLM as it is on Wikipedia.
There can be, you don't know if the closed source models aren't using something like DeepSeek's Engram.
The name "Engram" (n-gram) says it all - this is just another type of statistical word association, not a factual knowledge store.
While DeepSeek describe this as "knowledge lookup", what Engram is really trying to do is separate dynamic reasoning from static pattern recall, with the static patterns just being word-level n-gram statistics, not declarative facts/knowledge.
Just because 2-3 words often appear together in a sequence doesn't mean they represent a fact or truth (or falsehood) - it is just an n-gram statistical regularity.
If Engram helps reduce LLM GPU memory and FLOP requirements then that is great, but it's not a solution for Hallucination.
Agreed on the title, my bad! But yeah, I've had some truly terrible experiences using these "frontier" models in coding agents especially, where they just fabricate facts about codebases.
> GPT-5.5 and DeepSeek V4 Pro are two of the clearest hallucination leaders, despite being absolutely huge. Because of their immense size they simply did not learn how to say “I don’t know” or recognize intricate logical and technical fallacies.
This implies that bigger models are more likely to hallucinate? That doesn't match my experience.
I think it implies they are more likely to hallucinate if they don't know the answer. So a big model will return the correct answer more often than a small one, but in the cases where it doesn't, it will be more likely to make something up instead of saying "I don't know".
> A shift is happening among major AI labs, who are becoming increasingly skeptical of endless parameter count and training data scaling. The limits of this paradigm were put on the world’s stage when Claude Fable 5 was restricted by the US government just three days after its release, marking the first US AI ban stemming from national security. One of the biggest models in the world was banned because a single jailbreak was too much of a risk.
Such a weird thing to start with. The legal status of Fable does not mean that it's not intelligent. If anything, the problem is the opposite, someone thinks it's too intelligent (and/or that Anthropic wouldn't share its last gen intelligent models on the terms the government demanded).
> For the non technical, this is like asking a delivery driver to drop off packages at 3 houses at the same time without ever stopping the truck.
I'm already hallucinating about how this could work and it involves catapults
Or we could simply hallucinate that the packages are there at the three houses.
Hallucinations all the way down...
Nobody said the 3 houses needed to be on separate properties. Just throw the 3 packages from the moving truck at the one address where all 3 live.
Being an LLM is easy!
In the end it's just Boltzmann brains.
https://en.wikipedia.org/wiki/Boltzmann_brain
Tell the delivery driver "Make no mistakes" and it should work I heard.
I think hallucination rates are not a matter of model size but depends on the training of the model. They have been trained on a huge corpus of material that had overwhelmingly well formed questions and we'll formulated and correct answers. This is typically the case of books where the material is highly curated by experts in the field. In a book you never see a question which admit no answer and the book just reasoning and explaining why and how the question has no answer. Neither you will see a good question and the book explaining candidly it doesn't know the answer , because the way the book material is curated the author will omit discussing the question for which it has no answers.
In addition, I think that during HFRL, the labs has a bias for interesting answers that admit a solution and under represent the "bad" questions that admit no good answer. In addition they probably do less effort to HFRL on questions the model should admit it doesn't know.
As humans we have been trained all our lives, in the real world, to be confronted with questions we don't know the response right away and we learned to very quickly assess that we don't know or that we are not sure about the answer.
Another thing we have and LLM have not is fear. We have an amygdala in our brain, separated from the logic thinking part, that can raise a signal of fear so that we get much more carefully about what we say. On the other LLM has no fear organ like the amygdala and just learn to respond based on the patterns in it's training corpus. It never "fears" looking bad or being fired because it gave a wrong answer so it can merrily give perfectly wrong answers.
So, we see hallucination rates can be improved with training but currently the lab are not optimizing for that because there is an high stake race to get the most intelligent and capable model.
Alternatively I can see creating a separate amygdala-like organ for an LLM and that organ may asynchronously fires signal, based on the user prompt and the LLM thinking trace, to inject into the LLM reasoning a fear signal so that it can steer it's answer to something more safe.
I'd definitely agree that it isn't directly model size, but there is the fact that a larger model in terms of parameter count needs a large amount of training data to not overfit or underfit. So I think this race to the top of "max training data size" has kind of led to unintentional overfitting, not catastrophically, but enough to trigger this perceived omniscience within the model
Skinner would say it is not so much about emotions like fear or greed, but about consequences.
Yes, that's when we are mindful and we see the arise in our mind but we don't directly act out of it but we understand it and reason about our options and the consequences.
However the fear has to arise in the first place, to raise the alert.
I wonder if this is what a “Minimally Viable LLM” looks like. I often wonder how much of an LLM do you need before you can just shove a bigger context Window and any dynamic knowledge content to it like a PDF or markdown file to give it knowledge outside of its training data. I feel like LLMs don’t need more data they just need to be refined.
Purely anecdotal, but when OpenAI removed Codex-5.3 from the ChatGPT sub and forced me to move to GPT-5.5, the result was far worse than what I was enjoying with Codex.
And, of course, it was burning 10 times more tokens for this output.
I have the opposite experience with codex 5.3 I had to use 5.2 to design and 5.3-codex to execute , while 5.4 was a better in both, and 5.5 ( all used xhigh) is even better
Yeah they are 100% in the wrong for removing the fine tuned codex models. It makes sense why they wouldn't want to allocate so many resources towards fine tuning but still the enshittification of GPT models is real
Huh, the fine-tuned "codex" variants always seemed like "quick specific edit" prototypes that weren't meant for real use. They worked OK when you were very specific, but besides that, nowhere close to GPT5.X and the other "real" models.
Since Codex-5.3 came out it was my daily driver for everything: quick scripting, greenfield projects, new features on old projects...
Idk if it was the harness (OpenCode), my AGENT or my prompts, but I was getting exactly what I wanted, and quickly.
With GPT-5.5 it tries to play smart, takes much more times and is often stuck on basic stuff that DeepSeek solves oneshot.
> With GPT-5.5 it tries to play smart, takes much more times and is often stuck on basic stuff that DeepSeek solves oneshot.
You have any session logs or similar that shows this thing? Never once, since I started using the codex TUI when it became available, has GPT models gotten stuck on something another model breeze through, I quite literally run every prompt I do through multiple providers, this would be very visible very quickly for me.
I remember trying every -codex variant of the models and could never get them to be productive for tasks taking longer than 5-10 minutes, compared to GPT 5.5 which quite literally worked through the night day (with the /goal feature), and actually had something valuable and useful in the end this morning that wasn't exploding in LOC and complexity. I don't think any of the -codex variants would have been able to do this at all, based on how they worked when I last used them.
DS v4 is an undertrained snapshot, which is mentioned in their model card. The full version is supposed to be released later and have multimodal input. That said, hallucination rate likely depends on the training policy and different optimization tradeoffs a lot more than on the scale.
> This is a terrible line of thought
They're basing this all on public benchmarks which stopped being a reliable indicator of anything the last 2-3 years. Of course it'll be filled with more terrible lines of thoughts.
People really need to stop placing such importance on public benchmarks. They're valuable for comparing very close models, useful to evaluate if quantization and similar have negative impact, but you're not gonna be able to tell if one model is better than the other based on one scoring a few percentage points higher than a completely different model.
My anecdotal experience differs (though I hold ground that LLM evaluations are highly subjective and benchmarks are just as useful for LLMs as they are for dating websites users).
GLM 5.2 tends to stray way more than and 5.1. It also hallucinates you things subtly: morphs requirements, makes unfounded conclusions. This output is not something I experienced in any model I seen so far.
In coding it's especially annoying because it steers whole request. E.g. I give instruction: "make we a Rust-WASM-Canvas app" and GLM 5.2 goes like "Oh user surely doesn't mean that. I'll better build Dioxus app instead".
GLM 5.2 is great but it heavily detoriates once the context window gets past 200k tokens.
I've had more success with creating a plan first and then implementing it in (short-lived) sub-agents.
Ironically good software architecture patterns (small functions, single responsibility) heavily impact the performance of these models as well. They do surprisingly well in well architectured codebases.
They do very poorly in anything that's a mess where Opus and GPT 5.5 still get reasonable performance.
Yeah the benchmark for sure isn't perfect and without super rigid prompting it is far too easy for it to get off course. 28% hallucination rate isn't nothing either
if you're benchmaxxing then maybe bigger doesnt always mean better, but for general intelligence and big model smell, that couldn't be further from the truth
the oss models are impressive but it's pretty clear how quickly they fall off when you try to use them outside of a narrow set of problems they benchmarked well on when compared to opus/5.5
> Bigger is not better
The article uses the example of GLM being smaller than DeepSeek, yet better on hallucinations as "smaller can be good too"
But the GLM family itself is scaling up fast: GLM-5.x family is 754B, double the previous generation of GLM-4.x
> comes within just 4 points of GPT-5.5 and 9 points of Fable 5
9 percentage points IS a big difference
If we're hand waiving how an open source model from a Chinese lab that you can use a nearly unlimited amount for <100/mo's 9% difference from the premier, unavailable, expensive when it was available American frontier model, we've already lost.
The more I have been using 5.2 the more I have been impressed with it. And I’ve just been using the usually neutered ollama version.
The fact that a huge amount uf parameters may lead to worse hallucinations is something I didn't think of. Would this somewhat imply that DeepSeek V4 flash should be less prone tho these issues?
Surprisingly not! It is the biggest hallucinator on the AA Omniscience Index just 2pp away from V4 Pro. I think this is partially due to the fact that Flash was trained on >32T tokens just like Pro deapite being almost 10x smaller - it seems somewhat likely it was overfit.
I think we need better classification and taxonomy on erroneous LLM behaviors than the catch-all term "hallucinate"..
Please don't editorialize titles unless the original title is misleading.
> meaning on questions that it couldn’t figure out, it only stated that it didn’t know around 6% of the time, and the rest it confidently hallucinated an answer.
From how they measure it, a model that simply answers "I don't know." to any prompt would be the one hallucinates the least. So it's not surprising at all that a smaller model can perform better.
The fact that a huge amount uf parameters may lead to worse hallucinations is something I didn't think of. Would this somewhat imply that DeepSeek V4 flash should be less prone to these issues?
small models cannot encode so many facts, they will hallucinate more out-of-box
a key method to help with hallucinations is to provide good sources when asking questions (context engineering / knowledge base)
GLM 5.2 is really impressive at design as well. Overall loving it.
It's fine if it hallucinates, as long as it sounds overconfident
> One of the biggest models in the world was banned because a single jailbreak was too much of a risk.
We really don't know what the actual reason is given the politics at play. I would bet more on the Trump administration looking for any excuse to punish Anthropic
>GPT-5.5 and DeepSeek V4 Pro are two of the clearest hallucination leaders, despite being absolutely huge. Because of their immense size they simply did not learn how to say “I don’t know” or recognize intricate logical and technical fallacies. While it is true that a multi-trillion parameter model will always beat a lightweight consumer model on paper (today at least), the commoditization of these huge models is blurring the line between benchmark performance and actual real-world truthfulness and accuracy.
What about using two models, with a smaller model used for this kind of negative reasoning?
Now you need a third model to decide if the two other models disagree
hallucination is good for tasks that have an external oracle like computer programming
Could you explain what you mean? That feels like a waste of processing to me. Yes the model will correct itself once it eventually run a compiler/linter. But that's still wasted time and compute
Calling llm slop "hallucinating" is so counter-productive imo. After all, LLMs are just a variant of markov chains and as such this technology isn't able to discern falsehoods from truths. It's like trying to use a barometer to tell the time.
You are also just a variant of markov chains wired in your brain. So what you complaining about?
Well the difference here is that you're overly simplifying complex biology and many other factors whereas llms are in fact actually simple mathematical models. As always, the devil lies in the details. Dismissing intricacies is a useful tool for daydreamers, not so much for engineers.
And often it’s not perfect either. Just because one is true it doesn’t dismiss the other
Why is everyone expecting LLMs to be like the Star Trek computer? I wonder if anyone's ever measured what the hallucination rate of a human is.
Because AI company executives and devoted vibecoders constantly make egregious claims like "programming is fully solved" and even straight up "hallucinations don't exist on frontier models"
We don't have to listen to these people and can form our own perspectives. Following bad leaders is something to avoid
I agree, but I was responding to the question of why people expect LLMs to be like the star trek computer, and the answer is "because people making and promoting those LLMs claim they are like that"
It is unclear if GP is referring to the global we or the HN we. I leaned towards the latter and injected our knowledge and understandings into the basis for my comment. HN recognizes what's going on
> HN recognizes what's going on
Maybe the pre-2024 users do, but I've seen plenty of those exact "frontier models never hallucinate" comments on HN as well
Because this is how LinkedIn “specialists” promotes LLM. The same specialists shouting about crypto a few years ago, then specialists about nft and now about how coding, architecture, accounting, law, medicine and basically every white collar job is solved and you just need enough money to pay for Opus/GPT.
It’s not a lie if everyone collectively believes it
Yeah it has been looked at e.g. in [0]. They separate that from lying, but I think for the LLM context it should be included. To me the difference is humans do not bullshit at the same rate and I can find out over time who tends to bullshit more and exclude that persons info from my pool.
> Why is everyone expecting LLMs to be like the Star Trek computer?
Because they are often marketed as magic AIs, not as mere language models.
[0] https://bpspsychub.onlinelibrary.wiley.com/doi/10.1111/bjso....
Marketing, essentially
I would be so curious to find a comprehensive benchmark on this, humans do have an unfortunate ahem Dunning-Kruger effect ahem tendency to do this
loving glm 5.2 personally
This is where I asked GPT 5.5
"they say u hallucinate 3x more than GLM 5.2, whats your comeback to this? do i need to dump u? $article"