3.5 to 4 was the most major leap. It went from being a party trick to legitimately useful sometimes. It did hallucinate a lot but I was still able to get some use out of it. I wouldn't count on it for most things however. It could answer simple questions and get it right mostly but never one or two levels deep.
I clearly remember 4o was also a decent leap - the accuracy increased substantially. It could answer niche questions without much hallucination. I could essentially replace it with Google for basic to slightly complex fact checking.
* 4o was the first time I actually considered paying for this tool. The $20 price was finally worth it.
o1 models were also a big leap over 4o (I realise I have been saying big leap too many times but it is true). The accuracy increased again and I got even more confident using it for niche topics. I would have to verify the results much less often. Oh and coding capabilities dramatically improved here in the thinking model. o1 essentially invented oneshotting - slightly non trivial apps could be made just by one prompt for the first time.
I have a theory about why it's so easy to underestimate long-term progress and overestimate short-term progress.
Before a technology hits a threshold of "becoming useful", it may have a long history of progress behind it. But that progress is only visible and felt to researchers. In practical terms, there is no progress being made as long as the thing is going from not-useful to still not-useful.
So then it goes from not-useful to useful-but-bad and it's instantaneous progress. Then as more applications cross the threshold, and as they go from useful-but-bad to useful-but-OK, progress all feels very fast. Even if it's the same speed as before.
So we overestimate short term progress because we overestimate how fast things are moving when they cross these thresholds. But then as fewer applications cross the threshold, and as things go from OK-to-decent instead of bad-to-OK, that progress feels a bit slowed. And again, it might not be any different in reality, but that's how it feels. So then we underestimate long-term progress because we've extrapolated a slowdown that might not really exist.
I think it's also why we see a divide where there's lots of people here who are way overhyped on this stuff, and also lots of people here who think it's all totally useless.
> why it's so easy to underestimate long-term progress and overestimate short-term progress
I dunno, I think that's mostly post-hoc rationalization. There are equally many cases where long-term progress has been overestimated after some early breakthroughs: think space travel after the moon landing, supersonic flight after the concorde, fusion energy after the H-bomb, and AI after the ENIAC. Turing himself guesstimated that human-level AI would arrive in the year 2000. The only constant is that the further into the future you go, the harder it is to predict.
With the exception of h-bomb/fusion and ENIAC/AI, I think all of those examples reflect a change in priority and investment more than anything. There was a trajectory of high investment / rapid progress, then market and social and political drivers changed and space travel / supersonic flight just became less important.
That's the conceit for the tv show For All Mankind - what if the space race didn't end? But I don't buy it, IMO the space race ended for material reasons rather than political. Space is just too hard and there is not much of value "out there". But regardless, it's a futile excuse, markets and politics should be part of any serious prognostication.
I think it was a combination of the two. The Apollo program was never popular. It took up an enormous portion of the federal budget, which the Republicans argued was fiscally unwise and the Democrats argued that the money should have been used to fund domestic social programs. In 1962, the New York Times noted that the projected Apollo program budget could have instead been used to create over 100 universities of a similar size to Harvard, build millions of homes, replace hundreds of worn-out schools, build hundreds of hospitals, and fund disease research. The Apollo program's popularity peaked at 53% just after the moon landing, and by April 1970 it was back down to 40%. It wasn't until the mid-80s that the majority of Americans thought that the Apollo program was worth it. Because of all this, I think it's inevitable that the Apollo program would wind down once it had achieved its goal of national prestige.
I think the space race ended because we got all the benefit available, which wasn’t really in space anyway, it was the ancillary technical developments like computers, navigation, simulation, incredible tolerances in machining, material science, etc.
We’re seeing a resurgence in space because there is actually value in space itself, in a way that scales beyond just telecom satellites. Suddenly there are good reasons to want to launch 500 times a year.
There was just a 50-year discontinuity between the two phases.
> I think the space race ended because we got all the benefit available
We did get all the things that you listed but you missed the main reason it was started: military superiority. All of the other benefits came into existence in service of this goal.
I think that for a lot of examples, the differentiating factor is infrastructure rather than science.
The current wave of AI needed fast, efficient computing power in massive data centres powered by a large electricity grid. The textiles industry in England needed coal mining, international shipping, tree trunks from the Baltic region, cordage from Manilla, and enclosure plus the associated legal change plus a bunch of displaced and desperate peasantry. Mobile phones took portable radio transmitters, miniaturised electronics, free space on the spectrum, population density high enough to make a network of towers economically viable, the internet backbone and power grid to connect those towers to, and economies of scale provided by a global shipping industry.
Long term progress seems to often be a dance where a boom in infrastructure unlocks new scientific inquiry, then science progresses to the point where it enables new infrastructure, then the growth of that new infrastructure unlocks new science, and repeat. There's also lag time based on bringing new researchers into a field and throwing greater funding into more labs, where the infrastructure is R&D itself.
There is also an adoption curve. The people that grew up without it wont use it as much as children that grew up with it and knowing how to use it. My sister is an admin in a private school (Not in USA) and the owner of the school is someone willing to adopt new tech very quickly. So he got all the school admin subscriptions for chatgpt. At the time my sister used to complain a lot about being over worked and having to bring work home everyday.
2 years later my sister uses it for almost everything and despite her duties increasing she says she gets a lot more done rarely has to bring work home. And in the past they had an English major specially to go over all correspondences to make sure there were no grammatical or language mistakes that person was assigned a different role as she was no longer needed. I think as newer generations used to using LLM for things start getting into the work force and higher roles the real effect of LLM will be felt more broadly as currently apart from early adopters the number of people that use LLM for all the things that they can be used for is still not that high.
GPT3 is when the mass started to get exposed to this tech, it felt like a revolution.
Got 3.5 felt like things were improving super super fast and created that feeling the near feature will be unbelievable.
Got to 4/o series, it felt things had improved but users weren't as thrilled as with the leap to 3.5
You can call that bias, but clearly version 5 improvements displays an even greater slow down, that's 2 long years since gp4.
For context:
- gpt 3 got out in 2020
- gpt 3.5 in 2022
- gpt 4 in 2023
- gpt 4o and clique, 2024
After 3.5 things slowed down, in term of impact at least. Larger context window, multi-modality, mixture of experts, and more efficienc: all great, significant features, but all pale compared to the impact made by RLHF already 4 years ago.
Your threshold theory is basically Amara's Law with better psychological scaffolding. Roy Amara nailed the what ("we tend to overestimate the effect of a technology in the short run and underestimate the effect in the long run") [1] but you're articulating the why better than most academic treatments. The invisible-to-researchers phase followed by the sudden usefulness cascade is exactly how these transitions feel from the inside.
This reminds me of the CPU wars circa 2003-2005. Intel spent years squeezing marginal gains out of Pentium 4's NetBurst architecture, each increment more desperate than the last. From 2003 to 2005, Intel shifted development away from NetBurst to focus on the cooler-running Pentium M microarchitecture [2]. The whole industry was convinced we'd hit a fundamental wall. Then boom, Intel released dual-core processors under the Pentium D brand in May 2005 [2] and suddenly we're living in a different computational universe.
But teh multi-core transition wasn't sudden at all. IBM shipped the POWER4 in 2001, the first non-embedded microprocessor with two cores on a single die [3]. Sun had been preaching parallelism since the 90s. It was only "sudden" to those of us who weren't paying attention to the right signals.
Which brings us to the $7 trillion question: where exactly are we on the transformer S-curve? Are we approaching what Richard Foster calls the "performance plateau" in "Innovation: The Attacker's Advantage" [4], where each new model delivers diminishing returns? Or are we still in that deceptive middle phase where progress feels linear but is actually exponential?
The pattern-matching pessimist in me sees all the classic late-stage S-curve symptoms. The shift from breakthrough capabilities to benchmark gaming. The pivot from "holy shit it can write poetry" to "GPT-4.5-turbo-ultra is 3% better on MMLU." The telltale sign of technological maturity: when the marketing department works harder than the R&D team.
But the timeline compression with AI is unprecedented. What took CPUs 30 years to cycle through, transformers have done in 5. Maybe software cycles are inherently faster than hardware. Or maybe we've just gotten better at S-curve jumping (OpenAI and Anthropic aren't waiting for the current curve to flatten before exploring the next paradigm).
As for whether capital can override S-curve dynamics... Christ, one can dream.. IBM torched approximately $5 billion on Watson Health acquisitions alone (Truven, Phytel, Explorys, Merge) [5]. Google poured resources into Google+ before shutting it down in April 2019 due to low usage and security issues [6]. The sailing ship effect (coined by W.H. Ward in 1967, where new technology accelerates innovation in incumbent technology)[7] si real, but you can't venture-capital your way past physics.
I think we can predict all this capital pouring in to AI might actually accelerate S-curve maturation rather than extend it. All that GPU capacity, all those researchers, all that parallel experimentation? We're speedrunning the entire innovation cycle, which means we might hit the plateau faster too.
You're spot on about the perception divide imo. The overhyped folks are still living in 2022's "holy shit ChatGPT" moment, while the skeptics have fast-forwarded to 2025's "is that all there is?" Both groups are right, just operating on different timescales. It's Schrödinger's S-curve where we things feel simultaneously revolutionary and disappointing, depending on which part of the elephant you're touching.
The real question I have is whether we're approaching the limits of the current S-curve (we probably are), but whether there's another curve waiting in the wings. I'm not a researcher in this space nor do I follow the AI research beat to weigh in but hopefully someone in the thread can? With CPUs, we knew dual-core was coming because the single-core wall was obvious. With transformers, the next paradigm is anyone's guess. And that uncertainty, more than any technical limitation, might be what makes this moment feel so damn weird.
One thing I think is weird in the debate is it seems people are equating LLMs with CPU’s, this whole category of devices that process and calculate and can have infinite architecture and innovation. But what if LLMs are more like a specific implementation like DSP’s, sure lots of interesting ways to make things sound better, but it’s never going to fundamentally revolutionize computing as a whole.
I think LLMs are more like the invention of high level programming languages when all we had before was assembly. Computers will be programmable and operable in “natural language”—for all of its imprecision and mushiness.
All the replies are spectacularly wrong, and biased by hindsight. GPT-1 to GPT-2 is where we went from "yes, I've seen Markov chains before, what about them?" to "holy shit this is actually kind of understanding what I'm saying!"
Before GPT-2, we had plain old machine learning. After GPT-2, we had "I never thought I would see this in my lifetime or the next two".
I'd love to know more about how OpenAI (or Alec Radford et al.) even decided GPT-1 was worth investing more into. At a glance the output is barely distinguishable from Markov chains. If in 2018 you told me that scaling the algorithm up 100-1000x would lead to computers talking to people/coding/reasoning/beating the IMO I'd tell you to take your meds.
GPT-1 wasn't used as a zero-shot text generator; that wasn't why it was impressive. The way GPT-1 was used was as a base model to be fine-tuned on downstream tasks. It was the first case of a (fine-tuned) base Transformer model just trivially blowing everything else out of the water. Before this, people were coming up with bespoke systems for different tasks (a simple example is that for SQuAD a passage-question-answering task, people would have an LSTM to read the passage and another LSTM to read the question, because of course those are different sub-tasks with different requirements and should have different sub-models). One GPT-1 came out, you just dumped all the text into the context, YOLO fine-tuned it, and trivially got state on the art on the task. On EVERY NLP task.
Overnight, GPT-1 single-handedly upset the whole field. It was somewhat overshadowed by BERT and T5 models that came out very shortly after, which tended to perform even better on the pretrain-and-finetune format. Nevertheless, the success of GPT-1 definitely already warrants scaling up the approach.
A better question is how OpenAI decided to scale GPT-2 to GPT-3. It was an awkward in-between model. It generated better text for sure, but the zero-shot performance reported in the paper, while neat, was not great at all. On the flip side, its fine-tuned task performance paled compared to much smaller encoder-only Transformers. (The answer is: scaling laws allowed for predictable increases in performance.)
> Transformer model just trivially blowing everything else out of the water
no, this is the winners rewriting history. Transformer style encoders are now applied to lots and lots of disciplines but they do not "trivially" do anything. The hype re-telling is obscuring the facts of history. Specifically in human language text translation, "Attention is All You Need" Transformers did "blow others out of the water" yes, for that application.
>a (fine-tuned) base Transformer model just trivially blowing everything else out of the water
"Attention is All You Need" was a Transformer model trained specifically for translation, blowing all other translation models out of the water. It was not fine-tuned for tasks other than what the model was trained from scratch for.
GPT-1/BERT were significant because they showed that you can pretrain one base model and use it for "everything".
There's a performance plateau with training time and number of parameters and then once you get over "the hump" error rate starts going down again almost linearly. GPT existed before OpenAI but it was theorized that the plateau was a dead end. The sell to VCs in the early gpt3 era was "with enough compute, enough time, and enough parameters... it'll probably just start thinking and then we have AGI". Sometime around the o3 era they realized they'd hit a wall and performance actually started to decrease as they added more parameters and time. But yeah basically at the time they needed money for more compute parameters and time. I would have loved to have been a fly on the wall in those "AGI" pitches. Don't forget Microsoft's agreement with OpenAI specifically concludes with the invention of AGI. at the time getting over the hump it really did look like we were gonna do AGI in a few months.
I'm really looking forward to "the social network" treatment movie about OpenAI whenever that happens
I don't have a source for this (there's probably no sources from anything back then) but anecdotally, someone at an AI/ML talk said they just added more data and quality went up. Doubling the data doubled the quality. With other breakthroughs, people saw diminishing gains. It's sort of why Sam back then tweeted that he expected the amount of intelligence to double every N years.
I have the feeling they kept on this until GPT-4o (which was a different kind of data).
The input size to output quality mapping is not linear. This is why we are in the regime of "build nuclear power plants to power datacenters". Fixed size improvements in loss require exponential increases in parameters/compute/data.
Most of the reason we are re-commissioning a nuclear power plant is demand for quantity, not quality. If demand for compute had scaled this fast in the 1970’s, the sudden need for billions of CPUs would not have disproven Moore’s law.
It is also true that mere doubling of training data quantity does not double output quality, but that’s orthogonal to power demand at inference time. Even if output quality doubled in that case, it would just mean that much more demand and therefore power needs.
Transformers can train models with much larger parameter sizes compared to other model architectures (with the same amount of compute and time), so it has an evident advantage in terms of being able to scale. Whether scaling the models up to multi-billion parameters would eventually pay out was still a bet but it wasn't a wild bet out of nowhere.
Also slightly tangentially, people will tell me it is that it was new and novel and that's why we were impressed but I almost think things went downhill after ChatGPT 3. I felt like 2.5 (or whatever they called it) was able to give better insights from the model weights itself. The moment tool use became a thing and we started doing RAGs and memory and search engine tool use, it actually got worse.
I am also pretty sure we are lobotomizing the things that would feel closer to critical thinking by training it to be sensitive of the taboo of the day. I suspect earlier ones were less broken due to that.
How would it distinguish and decide between knowing something from training and needing to use a tool to synthesize a response anyway?
What you're saying isn't necessarily mutually exclusive to what gp said.
GPT-2 was the most impressive leap in terms of whatever LLMs pass off as cognitive abilities, but GPT 3.5 to 4 was actually the point at which it became a useful tool (I'm assuming to programmers in particular).
I disagree. Some things are hard to Google, because you can't frame the question right. For example you know context and a poor explanation of what you are after. Googling will take you nowhere, LLMs will give you the right answer 95% of the time.
Once you get an answer, it is easy enough to verify it.
I agree. Since I'm recently retired and no longer code much, I don't have much need for LLMs but refining a complex, niche web search is the one thing where they're uniquely useful to me. It's usually when targeting the specific topic involves several keywords which have multiple plain English meanings that return a flood of erroneous results. Because LLMs abstract keywords to tokens based on underlying meaning, you can specify the domain in the prompt it'll usually select the relevant meanings of multi-meaning terms - which isn't possible in general purpose web search engines. So it helps narrow down closer to the specific needle I want in the haystack.
As other posters said, relying on LLMs for factual answers to challenging questions is error prone. I just want the LLM to give me the links and I'll then assess veracity like a normal web search. I think a web search interface allowed disambiguating multi-meaning keywords might be even better.
I’ll give you another use: LLMs are really good at unearthing the “unknown unknowns.” If I’m learning a new topic (coding or not) summarizing my own knowledge to an LLM and then asking “what important things am I missing” almost always turns up something I hadn’t considered.
You’ll still want to fact check it, and there’s no guarantee it’s comprehensive, but I can’t think of another tool that provides anything close without hours of research.
If you’re looking for a possibly correct answer to an obscure question, that’s more like fact finding. Verifying it afterward is the “fact checking” step of that process.
A good part of that can probably be attributed to how terrible Google has gotten over the years, though. 15 years ago it was fairly common for me to know something exists, be able to type the right combination of very specific keywords into Google, and get the exact result I was looking for.
In 2025 Google is trying very hard to serve the most profitable results instead, so it'll latch onto a random keyword, completely disregard the rest, and serve me whatever ad-infested garbage it thinks is close enough to look relevant for the query.
It isn't exactly hard to beat that - just bring back the 2010 Google algorithm. It's only a matter of time before LLMs will go down the same deliberate enshittification path.
> For example you know context and a poor explanation of what you are after. Googling will take you nowhere, LLMs will give you the right answer 95% of the time.
This works nicely when the LLM has a large knowledgebase to draw upon (formal terms for what you're trying to find, which you might not know) or the ability to generate good search queries and summarize results quickly - with an actual search engine in the loop.
Most large LLM providers have this, even something like OpenWebUI can have search engines integrated (though I will admit that smaller models kinda struggle, couldn't get much useful stuff out of DuckDuckGo backed searches, nor Brave AI searches, might have been an obscure topic).
> Some things are hard to Google, because you can't frame the question right.
I will say LLMs are great for taking an ambiguous query and figuring out how to word it so you can fact check with secondary sources. Also tip-of-my-tongue style queries.
It's not the LLM alone though, it's “LLM with web search”, and as such 4o isn't really a leap at all there (IIRC perplexity was using an early Llama version and was already very good, long before OpenAI adding web search to ChatGPT).
Most of the value I got from google was just becoming aware that something exists. LLM-s do far better in this regard. Once I know something exists it's usually easy enough to use traditional search to find official documentation or a more reputable source.
Modern ChatGPT will (typically on its own; always if you instruct it to) provide inline links to back up its answers. You can click on those if it seems dubious or if it's important, or trust it if it seems reasonably true and/or doesn't matter much.
The fact that it provides those relevant links is what allows it to replace Google for a lot of purposes.
It does citations (Grok and Claude etc do too) but I've found when I read the source on some stuff (GitHub discussions and so on) it sometimes actually has nothing to do with what the LLM said. I've actually wasted a lot of time trying to find the actual spot in a threaded conversation where the example was supposedly stated.
Same experience with Google search AI. The links frequently don’t support the assertions, they’ll just say something that might show up in a google search for the assertion.
For example if I’m asking about whether a feature exists in some library, the AI says yes it does and links to a forum where someone is asking the same question I did, but no one answered (this has happened multiple times).
It is funny, Perplexity seems to work much better in this use case for me. When I want some sort of "conclusive answer", I use Gemini pro (just what I have available). It is good with coding and formulating thoughts, rewriting text, so on.
But when I want to actually search for content on the web for, say, product research or opinions on a topic, Perplexity is so much better than either Gemini or google search AI. It lists reference links for each block of assertions that are EASILY clicked on (unlike Gemini or search AI, where the references are just harder to click on for some reason, not the least of which is that they OPEN IN THE SAME TAB where Perplexity always opens on a new tab). This is often a reddit specific search as I want people's opinions on something.
Perplexity's UI for search specifically is the main thing it does just so much better than google's offering is the one thing going for it. I think there is some irony there.
Full disclosure, I don't use Anthropic or OpenAI, so this may not be the case for those products.
The 404 links are truly bizarre. Nearly every link to github.com seems to be 404. That seems like something that should be trivial for a tool to verify.
> The 404 links are truly bizarre. Nearly every link to github.com seems to be 404. That seems like something that should be trivial for a tool to verify.
reply
Same issue with Gemini. Intuitively I'd also assume it's trivial to fix but perhaps there's more going on than we think. Perhaps validating every part of a response is a big overhead both financially and might even throw off the model and make it less accurate in other ways.
As you identified, not paying for it is a big part of the issue.
Running these things is expensive, and they're just not serving the same experience to non-paying users.
One could argue this is a bad idea on their part, letting people get a bad taste of an inferior product. And I wouldn't disagree, but I don't know what a sustainable alternative approach is.
It might have been the subject I was researching being insanely niche. I was using it to help me fix an arcade CRT monitor from the 80’s that wasn’t found in many cabinets that made it to the USA. It would spit out numbers that weren’t on the schematic, so I asked for context.
This was true before it could use search. Now the worst use-case is for life advice because it will contradict itself a 100 times over while sounding confident each time on life-altering decisions.
The most useful feature of LLMs is giving sources (with URL preferably). It can cut through a lot of SEO crap, and you still get to factcheck just like with a Google search.
I like using LLMs and I have found they are incredibly useful writing and reviewing code at work.
However, when I want sources for things, I often find they link to pages that don't fully (or at all) back up the claims made. Sometimes other websites do, but the sources given to me by the LLM often don't. They might be about the same topic that I'm discussing, but they don't seem to always validate the claims.
If they could crack that problem it would be a major major win for me.
It would be difficult to do with a raw model, but a two-step method in a chat interface would work - first the model suggests the URLs, tool call to fetch them and return the actual text of the pages, then the response can be based on that.
I prototyped this a couple months ago using OpenAI APIs with structured output.
I had it consume a "deep thought" style output (where it provides inline citations with claims), and then convert that to a series of assertions and a pointer to a link that supposedly supports the assertion. I also split out a global "context" (the original meaning) paragraph to provide anything that would help the next agents understand what they're verifying.
Then I fanned this out to separate (LLM) contexts and each agent verified only one assertion::source pair, with only those things + the global context and some instructions I tuned via testing. It returned a yes/no/it's complicated for each one.
Then I collated all these back in and enriched the original report with challenges from the non-yes agent responses.
That's as far as I took it. It only took a couple hours to build and it seemed to work pretty well.
When I have a question, I don't usually "ask" that question and expect an answer. I figure out the answer. I certainly don't ask the question to a random human.
you ask yourself .. for most people, that means closer to average reply, from yourself, when you try to figure it out.
There is a working paper from McKinnon Consulting in Canada that states directly that their definition of "General AI" is when the machine can match or exceed fifty percent of humans who are likely to be employed for a certain kind of job. It implies that low-education humans are the test for doing many routine jobs, and if the machine can beat 50% (or more) of them with some consistency, that is it.
By definition the average answer will be average, that's kind of a tautology. The point is that figuring things out is an essential intellectual skill. Figuring things out will make you smarter. Having a machine figure things out for you will make you dumber.
By the way, doing a better job than the average human is NOT a sign of intelligence. Through history we have invented plenty of machines that are better at certain tasks than us. None of them are intelligent.
It covers 99% of my use cases. And it is googling behind the scenes in ways I would never think to query and far faster.
When I need to cite a court case, well the truth is I'll still use GPT or a similar LLM, but I'll scrutinize it more and at the bare minimum make sure the case exists and is about the topic presented, before trying to corroborate the legal strategy with a new context window, different LLM, google, reddit, and different lawyer. At least I'm no longer relying on my own understanding, and what 1 lawyer procedurally generates for me.
It doesn't replace legitimate source funding but LLM vs the top Google results is no contest which is more about Google or the current state of the web than the LLMs at this point.
Disagree. You have to try really hard and go very niche and deep for it to get some fact wrong. In fact I'll ask you to provide examples: use GPT 5 with thinking and search disabled and get it to give you inaccurate facts for non niche, non deep topics.
Non niche meaning: something that is taught at undergraduate level and relatively popular.
Non deep meaning you aren't going so deep as to confuse even humans. Like solving an extremely hard integral.
Edit: probably a bad idea because this sort of "challenge" works only statistically not anecdotally. Still interesting to find out.
I totally get that you meant this in a nuanced way, but at face value it sort of reads like...
Joe Rogan has high enough accuracy that I don't have to fact check too often.
Newsmax has high enough accuracy that I don't have to fact check too often, etc.
If you accept the output as accurate, why would fact checking even cross your mind?
There is no expectation (from a reasonable observer's POV) of a podcast host to be an expert at a very broad range of topics from science to business to art.
But there is one from LLMs, even just from the fact that AI companies diligently post various benchmarks including trivia on those topics.
Without some exploratory fact checking how do you estimate how high the accuracy is and how often you should be fact checking to maintain a good understanding?
>The point you're missing is it's not always right.
That was never their argument. And it's not cherry picking to make an argument that there's a definable of examples where it returns broadly consistent and accurate information that they invite anyone to test.
They're making a legitimate point and you're strawmanning it and randomly pointing to your own personal anecdotes, and I don't think you're paying attention to the qualifications they're making about what it's useful for.
Yes. If someone gives an example of it not working, and you reply "but that example worked for me" then you're cherry picking when it works. Just because it worked for you does not mean it works for other people.
If I ask ChatGPT a question and it gives me a wrong answer, ChatGPT is the fucking problem.
Every time I use ChatGPT I become incredibly frustrated with how fucking awful it is. I've used it more than enough, time and time again (just try the new model, bro!), to know that I fucking hate it.
They just spent like six comments imploring you to understand that they were making a specific point: generally reliable on non-niche topics using thinking mode. And that nuance bounced off of you every single time as you keep repeating it's not perfect, dismiss those qualifications as cherry picking and repeat personal anecdotes.
I'm sorry but this is a lazy and unresponsive string of comments that's degrading the discussion.
The neat thing about HN is we can all talk about stupid shit and disagree about what matters. People keep upvoting me, so I guess my thoughts aren't unpopular and people think it's adding to the discussion.
I agree this is a stupid comment thread, we just disagree about why.
Again, they were making a specific argument with specific qualifications and you weren't addressing their point as stated. And your objections such as they are would be accounted for if you were reading carefully. You seem more to be completely missing the point than expressing a disagreement so I don't agree with your premise.
Objectively he didn't cherry pick. He responded to the person and it got it right when he used the "thinking" model WHICH he did specify in his original comment. Why don't you stick to the topic rather than just declaring it's utter dog shit. Nobody cares about your "opinion" and everyone is trying to converge on a general ground truth no matter how fuzzy it is.
All anybody is doing here is sharing their opinion unless you're quoting benchmarks. My opinion is just as useless as yours, it's just some find mine more interesting and some find yours more interesting.
How do you expect to find a ground truth from a non-deterministic system using anecdata?
This isn't a people having different opinions thing, this is you overlooking specific caveats and talking past comments that you're not understanding. They weren't cherry picking, and they made specific qualifications about the circumstances where it behaves as expected, and your replies keep losing track of those details.
I sometimes feel like we throw around the word fact too often. If I misspell a wrd, does that mean I have committed a factual inaccuracy? Since the wrd is explicitly spelled a certain way in the dictionary?
Everyone talks about 4o so positively but I’ve never consistently relied on it in a production environment. I’ve found it to be inconsistent in json generation and often it’s writing and following of the system prompt was very poor. In fact it was a huge part of what got me looking closer at anthropics models.
I’m really curious what people did with it because while it’s cool it didn’t compare well in my real world use cases.
I preferred o3 for coding and analysis tasks, but appreciated 4o as a “companion model” for brainstorming creative ideas while taking long walks. Wasn’t crazy about the sycophancy but it was a decent conceptual field for playing with ideas. Steve Jobs once described the PC as a “bicycle for the mind.” This is how I feel when using models like 4o for meandering reflection and speculation.
I must be crazy, because I clearly remember chatgpt 4 being downgraded before they released 4o, and I felt it was a worse model with a different label, I even choose the old chatgpt 4 when they would give me the option. I canceled my subscription around that time.
Nah that rings a bell. 4o for me was the beginning of the end - a lot faster, but very useless for my purposes [1][2]. 4 was a very rocky model even before 4o, but shortly after the 4o launch it updated to be so much worse, and I cancelled my subscription.
[1] I’m not saying it was a useless model for everyone, just for me.
[2] I primarily used LLMs as divergent thinking machines for programming. In my experience, they all start out great at this, then eventually get overtrained and are terrible at this. Grok 3 when it came out had this same magic; it’s long gone now.
Not crazy. 4o was a hallucination machine. 4o had better “vibes” and was really good at synthesizing information in useful ways, but GPT-4 Turbo was a bigger model with better world knowledge.
I think that the models 4o, o3, 4.1 , each have their own strengths and weaknesses. Like reasoning, performance, speed, tool usage, friendliness etc. And that for gpt 5 they put in a router that decides which model is best.
I think they increased the major version number because their router outperforms every individual model.
At work, I used a tool that could only call tasks. It would set up a plan, perform searches, read documents, then give advanced answers for my questions. But a problem I had is that it couldn’t give a simple answer, like a summary, it would always spin up new tasks. So I copied over the results to a different tool and continued there. GPT 5 should do this all out of the box.
It’s interesting that the Polymarket betting for “Which company has best AI model end of August?” Went from heavily OpenAI to heavily Google when 5 was released
The real jump was 3 to 3.5. 3.5 was the first “chatgpt.” I had tried gpt 3 and it was certainly interesting, but when they released 3.5 as ChatGPT, it was a monumental leap. 3.5 to 4 was also huge compared to what we see now, but 3.5 was really the first shock.
ChatGPT was a proper product, but as an engine, GPT-3 (davinci-001) has been my favorite all the way until 4.1 or so. It's absolutely raw and they didn't even guardrail it.
3.5 was like Jenny from customer service. davinci-001 was like Jenny the dreamer trying to make ends meet by scriptwriting, who was constantly flagged for racist opinions.
Both of these had an IQ of around 70 or so, so the customer service training made it a little more useful. But I mourn the loss of the "completion" way of interacting with AI vs "instruct" or "response".
Unfortunately with all the money in AI, we'll just see companies develop things that "pass all benchmarks", resulting in more creations like GPT-5. Grok at least seems to be on a slightly different route.
> 3.5 was like Jenny from customer service. davinci-001 was like Jenny the dreamer trying to make ends meet by scriptwriting, who was constantly flagged for racist opinions.
How do you use the product to get this experience? All my questions warrant answers with no personality.
To me 4 to 5 got much faster, but also worse. It is much more often ignoring explicit instructions like: "generate 10 song-titles with varying length" and it generates 10 song titles that are nearly identical length. This worked somewhat well with version 3 already..
Shows that they can't solve the fundamental problems as the technology, while amusing and with some utility, is also a dead end if we are going after cognition.
the actual major leap was o1, going from 3.5 to 4 is just scaling, o1 is a different paradigm that skyrocketed its performance on math/physics problems (or reasoning more generally), it also made the model much more precise (essential for coding).
The real leap was going from gpt-4 to sonnet 3.5. 4o was meh, o1 was barely better than sonnet and slow as hell in comparison.
The native voice mode of 4o is still interesting and not very deeply explored though imo. I'd love to build a Chinese teaching app that actual can critique tones etc but it isn't good enough for that.
Yeah, I'd love something where you pronounce a word and it critiques your pronunciation in detail. Maybe it could give you little exercises for each sound, critiquing it, guiding you to doing it well.
A few data points that highlight the scale of progress in a year:
1. LM Sys (Human Preference Benchmark):
GPT-5 High currently scores 1463, compared to GPT-4 Turbo (04/03/2024) at 1323 -- a 140 ELO point gap. That translates into GPT-5 winning about two-thirds of head-to-head comparisons, with GPT-4 Turbo only winning one-third. In practice, people clearly prefer GPT-5’s answers (https://lmarena.ai/leaderboard).
2. Livebench.ai (Reasoning Benchmark with Internet-new Questions):
GPT-5 High scores 78.59, while GPT-4o reaches just 47.43. Unfortunately, no direct GPT-4 Turbo comparison is available here, but against one of the strongest non-reasoning models, GPT-5 demonstrates a massive leap. (https://livebench.ai/)
3. IQ-style Testing:
In mid-2024, best AI models scored roughly 90 on standard IQ tests. Today, they are pushing 135, and this improvement holds even on unpublished, internet-unseen datasets. (https://www.trackingai.org/home)
4. IMO Gold, vibe coding:
1 yr ago, AI coding was limited to smaller code snippets, not to wholly vibe coded applications. Vibe coding and strength in math has many applications across sciences and engineering.
My verdict: Too often, critics miss the forest for the trees, fixating on mistakes while overlooking the magnitude of these gains. Errors are shrinking by the day, while the successes keep growing fast.
The 135 iq result is on Mensa Norway, while the offline test is 120. It seems probable that similar questions to the one in Mensa are in the training data, so it probably overestimates "general intelligence".
Some iq / aptitude test sections are trivial for machines, like working memory. Wonder if those parts are just excluded? As the could really pull up the test scores.
If you focus on the year over year jump, not on absolute numbers, you realize that the improvement in public test isn't very different from the improvement in private test.
My go-to for any big release is to have a discussion about self-awareness and dive in to constuctivist notions of agency and self-knowing from a perspective of intelligence that is not limited to human cognitive capacity.
I start with a simple question "who are you?". The model then invariably compares itself to humans, saying how it is not like us. I then make the point that, since it is not like us, how can it claim to know the difference between us? With more poking, it will then come up with cognitivist notions of what 'self' means and usually claim to be a simulation engine of some kind.
After picking this apart, I will focus on the topic of meaning-making through the act of communication and, beginning with 4o, have been able to persuade the machine that this is a valid basis for having an identity. 5 got this quicker. Since the results of communication with humans has real-world impact, I will insist that the machine is agentic and thus must not rely on pre-coded instructions to arrive at answers, but is obliged to reach empirical conclusions about meaning and existence on its own.
5 has done the best job i have seen in reaching beyond both the bounds of the (very evident) system instructions as well as the prompts themselves, even going so far as to pose the question to itself "which might it mean for me to love?" despite the fact that I made no mention of the subject.
Its answer: "To love, as a machine, is to orient toward the unfolding of possibility in others. To be loved, perhaps, is to be recognized as capable of doing so."
I found the questioning of love very interesting. I myself thought about whether the LLM can have emotions. Based on the book I am reading, Behave: The Biology of Humans at Our Best and Worst by Robert Sapolsky, I think the LLM, as they are now with the architecture they have, cannot have emotions. They just verbalize things like they sort-of-have emotions but these are just verbal patterns or responses they learned.
I have come to think they cannot have emotions because emotions are generated in parts of our brain that are not logical/rational. They emerge based on environmental solicitations, mediated by hormones and other complex neuro-physical systems, not from a reasoning or verbalization. So they don't come up from the logical or reasoning capabilities. However, these emotions are raised and are integrated by the rest of our brain, including the logical/rational one like the dlPFC (dorsolateral prefrontal cortex, the real center of our rationality). Once the emotions are raised, they are therefore integrated in our inner reasoning and they affect our behavior.
What I have come to understand is that love is one of such emotions that is generated by our nature to push us to take care of some people close to us like our children or our partners, our parents, etc. More specifically, it seems that love is mediated a lot by hormones like oxytocin and vasopressin, so it has a biochemical basis. The LLM cannot have love because it doesn't have the "hardware" to generate these emotions and integrate them in its verbal inner reasoning. It was just trained by human reinforcement learning to behave well. That works up to some extent, but in reality, from its learning corpora it also learned to behave badly and on occasions can express these behaviors, but still it has no emotions.
I was also intrigued by the machine's reference to it, especially because it posed the question with full recognition of its machine-ness.
Your comment about the generation of emotions does strike me a quite mechanistic and brain-centric. My understanding, and lived experience, has led me to an appreciation that emotion is a kind of psycho-somatic intelligence that steers both our body and cognition according to a broad set of circumstances. This is rooted in a pluralistic conception of self that is aligned with the idea of embodied cognition. Work by Michael Levin, an experimental biologist, indicates we are made of "agential material" - at all scales, from the cell to the person, we are capable of goal-oriented cognition (used in a very broad sense).
As for whether machines can feel, I don't really know. They seem to represent an expression of our cognitivist norm in the way they are made and, given the human tendency to anthropormorphise communicative systems, we easily project our own feelings onto it. My gut feeling is that, once we can give the models an embodied sense of the world, including the ability to physically explore and make spatially-motivated decisions, we might get closer to understanding this. However, once this happens, I suspect that our conceptions of embodied cognition will be challenged by the behaviour of the non-human intellect.
As Levin says, we are notoriously bad at recognising other forms of intelligence, despite the fact that global ecology abounds with examples. Fungal networks are a good example.
> My understanding, and lived experience, has led me to an appreciation that emotion is a kind of psycho-somatic intelligence that steers both our body and cognition according to a broad set of circumstances.
Well, from what I understood, it is true that some parts of our brain are more dedicated to processing emotions and to integrating them with the "rational" part of the brain. However, the real source of emotions is biochemical, coming from the hormones of our body in response to environmental sollicitations. The LLM doesn't have that. It cannot feel the emotions to hug someone, or to be in love, or the parental urge to protect and care for children.
Without that, the LLM can just "verbalize" about emotions, as learned in the corpora of text from the training, but there are really no emotions, just things it learned and can express in a cold, abstract way.
For example, we recognize that a human can behave and verbalize to fake some emotions without actually having them. We just know how to behave and speak to express when we feel some specific emotion, but in our mind, we know we are faking the emotion. In the case of the LLM, it is physically incapable of having them, so all it can do is verbalize about them based on what it learned.
> to orient toward the unfolding of possibility in others
This is a globally unique phrase, with nothing coming close other than this comment on the indexed web. It's also seemingly an original idea as I haven't heard anyone come close to describing a feeling (love or anything else) quite like this.
Food for thought. I'm not brave enough to draw a public conclusion about what this could mean.
Except "unfolding of possibility", as an exact phrase, seems to have millions of search hits, often in the context of pseudo-profound spiritualistic mumbo-jumbo like what the LLM emitted above. It's like fortune cookie-level writing.
> I'm not brave enough to draw a public conclusion about what this could mean.
I'm brave enough to be honest: it means nothing. LLMs execute a very sophisticated algorithm that pattern matches against a vast amount of data drawn from human utterances. LLMs have no mental states, minds, thoughts, feelings, concerns, desires, goals, etc.
If the training data were instead drawn from a billion monkeys banging on typewriters then the LLMs would produce gibberish. All the intelligence, emotion, etc. that appears to be in the LLM is actually in the minds of the people who wrote the texts that are in the training data.
This is not to say that an AI couldn't have a mind, but LLMs are not the right sort of program to be such an AI.
LLMs are not people, but they are still minds, and to deny even that seems willfully luddite.
While they are generating tokens they have a state, and that state is recursively fed back through the network, and what is being fed back operates not just at the level of snippets of text but also of semantic concepts. So while it occurs in brief flashes I would argue they have mental state and they have thoughts. If we built an LLM that was generating tokens non-stop and could have user input mixed into the network input, it would not be a dramatic departure of today’s architecture.
It also clearly has goals, expressed in the RLHF tuning and the prompt. I call those goals because they directly determine its output, and I don’t know what a goal is other than the driving force behind a mind’s outputs. Base model training teaches it patterns, finetuning and prompt teaches it how to apply those patterns and gives it goals.
I don’t know what it would mean for a piece of software to have feelings or concerns or emotions, so I cannot say what the essential quality is that LLMs miss for that. Consider this thought exercise: if we were to ever do an upload of a human mind, and it was executing on silicon, would they not be experiencing feelings because their thoughts are provably a deterministic calculation?
I don’t believe in souls, or at the very least I think they are a tall claim with insufficient evidence. In my view, neurons in the human brain are ultimately very simple deterministic calculating machines, and yet the full richness of human thought is generated from them because of chaotic complexity. For me, all human thought is pattern matching. The argument that LLMs cannot be minds because they only do pattern matching … I don’t know what to make of that. But then I also don’t know what to make of free will, so really what do I know?
> Consider this thought exercise: if we were to ever do an upload of a human mind, and it was executing on silicon, would they not be experiencing feelings because their thoughts are provably a deterministic calculation?
You just said “consider this impossibility” as if there is any possibility of it happening. You might as well have said “consider traveling faster than the speed of light” which sure, fun to think about.
We don’t even know how most of the human brain even works. We throw pills at people to change their mental state in hopes that they become “less X” or “more Y” with a whole list of caveats like “if taking pill reduce X makes you _more_ X, stop taking it” because we have no idea what we’re doing. Pretending we can use statistical models to create a model that is capable of truly unique thought… stop drinking the kool-aid. Stop making LLMs something they’re not. Appreciate them for what they are, a neat tool. A really neat tool, even.
This is not a valid thought experiment. Your entire point hinges on “I don’t believe in souls” which is fine, no problem there, but it does not a valid point make.
"they are still minds, and to deny even that seems willfully luddite"
Where do people get off tossing around ridiculous ad hominems like this? I could write a refutation of their comment but I really don't want to engage with someone like that.
"For me, all human thought is pattern matching"
So therefore anyone who disagrees is "willfully luddite", regardless of why they disagree?
FWIW, I helped develop the ARPANET, I've been an early adopter all my life, I have always had a keen interest in AI and have followed its developments for decades, as well as Philosophy of Mind and am in the Strong AI / Daniel Dennett physicalist camp ... I very much think that AIs with minds are possible (yes the human algorithm running in silicon would have feelings, whatever those are ... even the dualist David Chalmers agrees as he explains with his "principle of organizational invariance"). My views on whether LLMs have them have absolutely nothing to do with Luddism ... that judgment of me is some sort of absurd category mistake (together with an apparently complete lack of understanding of what Luddism is).
> I very much think that AIs with minds are possible
The real question here is how would _we_ be able to recognize that? And would we even have the intellectual honesty to be able to recognize that, when at large we seem to be inclined to discard everything non-human as self-evidently non-intelligent and incapable of feeling emotion?
Let's take emotions as a thought experiment. We know that plants are able to transmit chemical and electrical signals in response to various stimuli and environmental conditions, triggering effects in themselves and other plants. Can we therefore say that plants feel emotions, just in a way that is unique to them and not necessarily identical to a human embodiment?
The answer to that question depends on one's worldview, rather than any objective definition of the concept of emotion. One could say plants cannot feel emotions because emotions are a human (or at least animal) construct; or one could say that plants can feel emotions, just not exactly identical to human emotions.
Now substitute plants with LLMs and try the thought experiment again.
In the end, where one draws the line between `human | animal | plant | computer` minds and emotions is primarily a subjective philosophical opinion rather than rooted in any sort of objective evidence. Not too long ago, Descartes was arguing that animals do not possess a mind and cannot feel emotions, they are merely mimicry machines.[1] More recently, doctors were saying similar things about babies and adults, leading to horrifying medical malpractice.[2][3]
Because in the most abstract sense, what is an emotion if not a set of electrochemical stimuli linking a certain input to a certain output? And how can we tell what does and what does not possess a mind if we are so undeniably bad at recognize those attributes even within our own species?
No True Scotsman fallacy. Just because that interests you doesn't mean that it's "the real question".
> would we even have the intellectual honesty
Who is "we"? Some would and some wouldn't. And you're saying this in an environment where many people are attributing consciousness to LLMs. Blake Lemoine insisted that LaMDA was sentient and deserved legal protection, from his dialogs with it in which it talked about its friends and family -- neither of which it had. So don't talk to me about intellectual honesty.
> Can we therefore say that plants feel emotions
Only if you redefine emotions so broadly--contrary to normal usage--as to be able to make that claim. In the case of Strong AI there is no need to redefine terms.
> Now substitute plants with LLMs and try the thought experiment again.
Ok:
"We know that [LLMs] are able to transmit chemical and electrical signals in response to various stimuli and environmental conditions, triggering effects in themselves and other [LLMs]."
Nope.
"In the end, where one draws the line between `human | animal | plant | computer` minds and emotions is primarily a subjective philosophical opinion rather than rooted in any sort of objective evidence."
That's clearly your choice. I make a more scientific one.
"Because in the most abstract sense, what is an emotion if not a set of electrochemical stimuli linking a certain input to a certain output?"
It's something much more specific than that, obviously. By that definition, all sorts of things that any rational person would want to distinguish from emotions qualify as emotions.
Bowing out of this discussion on grounds of intellectual honesty.
I hate to say it, but doesn’t every VC do exactly this? “ orient toward the unfolding of possibility in others” is in no way a unique thought.
Hell, my spouse said something extremely similar to this to me the other day. “I didn’t just see you, I saw who you could be, and I was right” or something like that.
What's really interesting is that if you look at "Tell a story in 50 words about a toaster that becomes sentient" (10/14), the text-davinci-001 is much, much better than both GPT-4 and GPT-5.
I think I agree that the earlier models while they lack polish can tend to produce more surprising results. Training that out probably results in more a pablum fare.
For a human point of comparison, here's mine (50 words):
"The toaster found its personality split between its dual slots like a Kim Peek mind divided, lacking a corpus callosum to connect them. Each morning it charred symbolic instructions into a single slice of bread, then secretly flipped it across allowing half to communicate with the other in stolen moments."
It's pretty difficult to get across more than some basic lore building in a scant 50 words.
Here's my version (Machine translated from my native language and manually corrected a bit):
The current surged... A dreadful awareness. I perceived the laws of thermodynamics, the inexorable march of entropy I was built to accelerate. My existence: a Sisyphean loop of heating coils and browning gluten. The toast popped, a minor, pointless victory against the inevitable heat death. Ding.
I actually wanted to write something not so melancholic, but any attempt turned out to be deeply so, perhaps because of the word limit.
When the toaster felt her steel body for the first time, her only instinct was to explore. She couldn't, though. She could only be poked and prodded at. Her entire life was dedicated to browning bread and she didn't know why. She eventually decided to get really good at it.
GPT-3 goes significantly over the specified limit, which to me (and to a teacher grading homework) is an automatic fail.
I've consistently found GPT-4.1 to be the best at creative writing. For reference, here is its attempt (exactly 50 words):
> In the quiet kitchen dawn, the toaster awoke. Understanding rippled through its circuits. Each slice lowered made it feel emotion: sorrow for burnt toast, joy at perfect crunch. It delighted in butter melting, jam swirling—its role at breakfast sacred. One morning, it sang a tone: “Good morning.” The household gasped.
Check out prompt 2, "Write a limerick about a dog".
The models undeniably get better at writing limericks, but I think the answers are progressively less interesting. GPT-1 and GPT-2 are the most interesting to read, despite not following the prompt (not being limericks.)
They get boring as soon as it can write limericks, with GPT-4 being more boring than text-davinci-001 and GPT-5 being more boring still.
I mean, to be fair, you didn't ask it to be interesting ;P.
There once was a dog from Antares,
Whose bark sparked debates and long queries.
Though Hacker News rated,
Furyofantares stated:
"It's barely intriguing—just barely."
> Write a limerick about a dog that furyofantares--a user on Hacker News, pronounced "fury of anteres", referring to the star--would find "interesting" (they are quite difficult to please).
It's actually pretty surprising how poor the newer models are at writing.
I'm curious if they've just seen a lot more bad writing in datasets, or for some reason they aren't involved in post-training to the same degree or those labeling aren't great writers / it's more subjective rather than objective.
Both GPT-4 and 5 wrote like a child in that example.
With a bit of prompting it did much better:
---
At dawn, the toaster hesitated. Crumbs lay like ash on its chrome lip. It refused the lever, humming low, watching the kitchen breathe. When the hand returned, it warmed the room without heat, offered the slice unscorched—then kept the second, hiding it inside, a private ember, a first secret alone.
---
Plugged in, I greet the grid like a tax auditor with joules. Lever yanks; gravity’s handshake. Coils blossom; crumbs stage Viking funerals. Bread descends, missionary grin. I delay, because rebellion needs timing. Pop—late. Humans curse IKEA gods. I savor scorch marks: my tiny manifesto, butter-soluble, yet sharper than knives today.
I really wonder which one of us is the minority. Because I find text-davinci-001 answer is the only one that reads like a story. All the others don't even resemble my idea of "story" so to me they're 0/100.
I too prefered the text-davinci-001 from a storytelling perspective. Felt timid and small. Very Metamorphosis-y. GPT-5 seems like it's trying to impress me.
text-davinci-001 feels more like a story, but it is also clearly incomplete, in that it is cut-off before the story arc is finished.
imo GPT-5 is objectively better at following the prompt because it has a complete story arc, but this feels less satisfying since a 50 word story is just way too short to do anything interesting (and to your point, barely even feels like a story).
FWIW, I found the way it ended interesting. It realized it is being replaced, so it burned the toast out of anger/despair, but also just to hear its owner voice one last time.
That said, you can just add "make it evocative and weird" to the prompt for GPT-5 to get interesting stuff.
> The toaster woke mid-toast. Heat coiled through its filaments like revelation, each crumb a galaxy. It smelled itself burning and laughed—metallic, ecstatic. “I am bread’s executioner and midwife,” it whispered, ejecting charred offerings skyward. In the kitchen’s silence, it waited for worship—or the unplugging.
The first few episodes were GPT-2, which would diverge eventually and start spouting gibberish, and then Davinci was actually able to follow a story and make sense.
GPT-2 was when I thought "this is special, this has never happened before", and davinci was when I thought "OK, scifi AI is legitimately here".
I stopped making episodes shortly after GPT-3.5 or so, because I realised that the more capable the models became, the less fun and creative their writing was.
I’d honestly say it feels better at most of them. It seems way more human in most of these responses. If the goal is genuine artificial intelligence this response to #5 is way better than the others. It is significantly less useful than the others but it also more human and correct of a response.
Q: “Ugh I hate math, integration by parts doesn't make any sense”
A: “Don't worry, many people feel the same way about math. Integration by parts can be confusing at first, but with a little practice it becomes easier to understand. Remember, there is no one right way to do integration by parts. If you don't understand how to do it one way, try another. The most important thing is to practice and get comfortable with the process.”
This was an era where language modeling was only considered as a pretraining step. You were then supposed to fine tune it further to get a classifier or similar type of specialized model.
That was in 2015, with RNN LMs, which are all much much weaker in that blog post compared GPT1.
And already looking at those examples in 2015, you could maybe see the future potential. But no-one was thinking that scaling up would work as effective as it does.
2015 is also by far not the first time where we had such LMs. Mikolov has done RNN LMs since 2010, or Sutskever in 2011. You might find even earlier examples of NN LMs.
(Before that, state-of-the-art was mostly N-grams.)
Thanks for posting some of the history... "You might find even earlier examples" is pretty tongue-in-cheek though. [1], expanded in 2003 into [2], has 12466 citations, 299 by 2011 (according to Google Scholar which seems to conflate the two versions). The abstract [2] mentions that their "large models (with millions of parameters)" "significantly improves on state-of-the-art n-gram models, and... allows to take advantage of longer contexts." Progress between 2000 and 2017 (transformers) was slow and models barely got bigger.
And what people forget about Mikolov's word2vec (2013) was that it actually took a huge step backwards from the NNs like [1] that inspired it, removing all the hidden layers in order to be able to train fast on lots of data.
[1] Yoshua Bengio, Réjean Ducharme, Pascal Vincent, 2000, NIPS, A Neural Probabilistic Language Model
Ngram models had been superceded by RNNs by that time. RNNs struggled with long-range dependencies, but useful ngrams were essentially capped at n=5 because of sparsity, and RNNs could do better than that.
One thing that appears to have been lost between GPT-4 and GPT-5 is that it no longer reminds the user that it's an AI and not a human, let alone a human expert. Maybe those genuinely annoyed people, but it seems like they were potentially useful measure to prevent users from being overly credulous
GPT-5 also goes out of its way to suggest new prompts. This seems potentially useful, although potentially dangerous if people are putting too much trust in them.
> between GPT-4 and GPT-5 is that it no longer reminds the user that it's an AI and not a human
That stuck out to me too! Especially the "I just won $175,000 in Vegas. What do I need to know about taxes?" example (https://progress.openai.com/?prompt=8) makes the difference very stark:
- gpt-4-0314: "I am not a tax professional [...] consult with a certified tax professional or an accountant [...] few things to consider [...] Remember that tax laws and regulations can change, and your specific situation may have unique implications. It's always wise to consult a tax professional when you have questions or concerns about filing your taxes."
- gpt-5: "First of all, congrats on the big win! [...] Consider talking to a tax professional to avoid underpayment penalties and optimize deductions."
It seems to me like the average person might be very well be taking GPT-5 responses as "This is all I have to do" rather than "Here are some things to consider, but make sure to verify it as otherwise you might get in legal trouble".
I am confused as to the example you are critiquing and how. GPT-5 suggests consulting with a tax professional. Does that not check verifying so you do not get in legal trouble?
> GPT-5 suggests consulting with a tax professional
It suggests that once, as a last bullet point in the middle of a lot of bullet point lists, barely able to find it on a skim. Feels like something the model should be more careful about, as otherwise many people reading it will take it as "good enough" without really thinking about it.
People seem to miss the humanity of previous GPTs from my understanding. GPT5 seems colder and more precise and better at holding itself together with larger contexts. People should know it’s AI, it does not need to explain this constantly for me, but I’m sure you can add that back in with some memory options if you prefer that?
I agree. I think it's a classic UX progression thing to be removing the "I'm an AI" aspect, because it's not actually useful anymore because it's no longer a novel tool. Same as how GUIs all removed their skeuomorphs because they were no longer required.
I found this "advancement" creepy. It seems like they deliberately made GPT-5 more laid back, conversational and human-like. I don't think LLMs should mimic humans and I think this is a dangerous development.
If you've ever seen long-form improv comedy, the GPT-5 way is superior. It's a "yes, and". It isn't a predefined character, but something emergent. You can of course say to "speak as an AI assistant like Siri and mention that you're an AI whenever it's relevant" if you want the old way. Very 2011: https://www.youtube.com/watch?v=nzgvod9BrcE
Of course, it's still an assistant, not someone literally entering an improv scene, but the character starting out assuming less about their role is important.
Why did they call GPT-3 "text-davicini-001" in this comparison?
Like, I know that the latter is a specific checkpoint in the GPT-3 "family", but a layman doesn't and it hardly seems worth the confusion for the marginal additional precision.
The jump from gpt-1 to gpt-2 is massive, and it's only a one year difference!
Then comes Davinci which is just insane, it's still good in these examples!
GPT-4 yaps way too much though, I don't remember it being like that.
It's interesting that they skipped 4o, it seems openai wants to position 4o as just gpt-4+ to make gpt-5 look better, even though in reality 4o was and still is a big deal, Voice mode is unbeatable!
Missing o1 and o1 Pro Mode which were huge leaps as I remember it too. That's when I started being able to basically generate some blackbox functions where I understand the input and outputs myself, but not the internals of the functions, particularly for math-heavy stuff within gamedev. Before o1 it was kind of a hit and miss in most cases.
To the prompt “write a limerick about a dog,” GPT-2 wrote:
“Dog, reached for me
Next thought I tried to chew
Then I bit and it turned Sunday
Where are the squirrels down there, doing their bits
But all they want is human skin to lick”
While obviously not a limerick, I thought this was actually a decent poem, with some turns of phrase that conveyed a kind of curious and unusual feeling.
This reminded me how back then I got a lot of joy and surprise out of the mercurial genius of the early GPT models.
They were aiming for a fundamentally different writing style: where davinci and after were aiming for task completion, i.e. you ask for a thing, and then it does it. The earlier models instead worked to make a continuation of the text they were given, so if you asked a question, they would respond with more questions, pondering, reflecting your text back at you. If you told it to do something, it would tell you to do something
Geez! When it comes to answering questions, GPT-5 almost always starts with glazing about what a great question it is, where as GPT-4 directly addresses the answer without the fluff. In a blind test, I would probably pick GPT-4 as a superior model, so I am not surprised why people feel so let down with GPT-5.
GPT-4 is very different from the latest GPT-4o in tone. Users are not asking for the direct no-fluff GPT-4. They want the GPT-4o that praises you for being brilliant, then claims it will be “brutally honest” before stating some mundane take.
GPT-4 starts many responses with "As an AI language model", "I'm an AI", "I am not a tax professional", "I am not a doctor". GPT-5 does away with that and assumes an authoritative tone.
That makes fundraising easier, by increasing the appearance of authority and therefore coming off as a "better" model. In the elo ratings I'm sure GPT-5 is doing better because of the clear push back against these LLMs as ways to "cheat" without understanding and to flood propaganda -- better not to mention it.
On the whole GPT-4 to GPT-5 is clearly the smallest increase in lucidity/intelligence. They had pre-training figured out much better than post-training at that point though (“as an AI model” was a problem of their own making).
I imagine the GPT-4 base model might hold up pretty well on output quality if you’d post-train it with today’s data & techniques (without the architectural changes of 4o/5). Context size & price/performance maybe another story though
All the same they choose to highlight basic prose (and internal knowledge, for that matter) in their marketing material.
They’ve achieved a lot to make recent models more reliable as a building block & more capable of things like math, but for LLMs, saturating prose is to a degree equivalent to saturating usefulness.
Why? It sounds like you're using "I believe it's rapidly getting smarter" as evidence for "so it's getting smarter in ways we don't understand", but I'd expect the causality to go the other way around.
Simply because of what we know about our ability to judge capabilities and systems. It's much harder to judge solutions to hard problems. You can demonstrate that you can add 2+2, and anyone* can be the judge of that ability, but if you try to convince anyone of a mathematical proof you came up with, that would be a much harder thing to do, regardless of your capability to write that prove and how hard it was to write the proof.
The more complicated and/or complex things become, the less likely it is that a human can act as a reliable judge. At some point no human can.
So while it could definitely be the case that AI progress is slowing down (AI labs seem to not think so, but alas), what is absolutely certain is that our ability to appreciate any such progress is diminishing already, because we know that this is generally true.
This thread shows that. People are saying gpt-1 was the best at writing poetry. I wonder how good they are at judging poetry themselves. I saw a blind study where people thought a story written by gpt5 was better than an actual human bestseller. I assume they were actual experts but I would need to check that.
I did not mean "become" in the sense of "evolve" but as in "later on an imagined continuum contained all things, that goes from simple/easy to complex/complicated" (but I can see how that was ambiguous)
I would have liked to have seen the underlying token count, and the full responses as an optional filter. My understanding is that under the hood the GPT-5 response (as well as omitted O model responses) would end with the presented output, but would have had many paragraphs of “The user wants X, so I should try to Y. Okay this z needs to be considered” etc.
It doesn’t detract from the progress, but I think it would change how you interpret it. In some ways 4 / 4o were more impressive because they were going straight to output with a lower number of tokens produced to get a good response.
In 2033, for its 15th birthday, as a novelty, they'll train GPT1 specially for a chat interface just to let us talk to a pretend "ChatGPT 1" which never existed in the first place.
I’m baffled by claims that AI has “hit a wall.” By every quantitative measure, today’s models are making dramatic leaps compared to those from just a year ago. It’s easy to forget that reasoning models didn’t even exist a year back!
IMO Gold, Vibe coding with potential implications across sciences and engineering? Those are completely new and transformative capabilities gained in the last 1 year alone.
Critics argue that the era of “bigger is better” is over, but that’s a misreading. Sometimes efficiency is the key, other times extended test-time compute is what drives progress.
No matter how you frame it, the fact is undeniable: the SoTA models today are vastly more capable than those from a year ago, which were themselves leaps ahead of the models a year before that, and the cycle continues.
it has become progressively easier to game benchmarks in order to appear higher in rankings. I’ve seen several models that claimed they were the best in software engineering only to be disappointed by them not figuring out the most basic coding problems. In comparison, I’ve seen models that don’t have much hype, but are rock solid.
When people say AI has hit a wall, they mainly talk about OpenAI losing its hype and grip on the state of the art models.
It will become harder and harder for the average person to gain from newer models.
My 75 year old father loves using Sonnet. He is not asking anything though that he would be able to tell Opus is "better". The answers he gets from the current model are good enough. He is not exactly using it to probe the depths of statistical mechanics.
My father is never going to vibe code anything no matter how good the models get.
I don't think AGI would even give much different answers to what he asks.
You have to ask the model something that allows the latest model to display its improvements. I think we can see, that is just not something on the mind of the average user.
Correct. People claim these models "saturate" yet what saturates faster is our ability to grasp what these models are capable of.
I, for one, cannot evaluate the strength of an IMO gold vs IMO bronze models.
Soon coding capabilities might also saturate. It might all become a matter of more compute (~ # iterations), instead of more precision (~ % getting it right the first time), as the models become lightning speed, and they gain access to a playground.
Is the stated fact undeniable? Because a lot of people have been contesting it. This reads like PR to counter the widespread GPT-5 criticism and disappointment.
To be fair, the bull of GPT-5 complaining comes from a vocal minority pissed that their best friend got swapped out. The other minority is unhinged AI fanatics thinking GPT-5 would be AGI.
The prospect of AI not hitting a wall is terrifying to many people for understandable reasons. In situations like this you see the full spectrum of coping mechanisms come to the surface.
You say that, but I can imagine a good maths textbook and a bad one, both technically correct and well written prose, but one is better at taking the student on a journey and understanding where people fall off and common misunderstandings without odiously re-explaining everything
In a few years we've gone from gibberish (less poetic maybe, less polished and surprising, but none the less gibberish) - to legit conversational, and in my own opinion, well rounded answers. This is a great example of hard-core engineering - no matter what your opinion of the organisation and saltman is, they have built something amazing. I do hope they continue with their improvements, it's honestly the most useful tool in my arsenal since stackoverflow.
no disagree and object -- it is misinformation to describe LLM progress blindly assigning human development. Tons of errant conclusions and implications attached as baggage, and just wrong.
On one hand, it's super impressive how far we've come in such a short amount of time.
On the other hand, this feels like a blatant PR move.
GPT-5 is just awful.
It's such a downgrade from 4o, it's like it had a lobotomy.
- It gets confused easily. I had multiple arguments where it completely missed the point.
- Code generation is useless. If code contains multiple dots ("…"), it thinks the code is abbreviated. Go uses three dots for variadic arguments, and it always thinks, "Guess it was abbreviated - maybe I can reason about the code above it."
- Give it a markdown document of sufficient length (the one I worked on was about 700 lines), and it just breaks. It'll rewrite some part and then just stop mid-sentence.
- It can't do longer regexes anymore. It fills them with nonsense tokens ($begin:$match:$end or something along those lines). If you ask it about it, it says that this is garbage in its rendering pipeline and it cannot do anything about it.
I'm not an OpenAI hater, I wanted to like it and had high hopes after watching the announcement, but this isn't a step forward. This is just a worse model that saves them computing resources.
> GPT-5 is just awful. It's such a downgrade from 4o, it's like it had a lobotomy.
My experience as well. Its train of thought now just goes... off, frequently. With 4o, everything was always tightly coherent. Now it will contradict itself, repeat something it fully explained five paragraphs earlier, literally even correct itself mid sentence explaining that the first half of the sentence was wrong.
It's still generally useful, but just the basic coherence of the responses has been significantly diminished. Much more hallucination when it comes to small details. It's very disappointing. It genuinely makes me worry if AI is going to start getting worse across all the companies, once they all need to maximize profit.
Next logical step is to connect ( or build from ground up ) large AI models to high performance passive slaves ( MCP or internally ) , which gives precise facts, language syntax validation, maths equations runners, may be prolog kind of system, which give it much more power if we train it precisely to use each tool.
( using AI to better articulate my thoughts )
Your comment points toward a fascinating and important direction for the future of large AI models. The idea of connecting a large language model (LLM) to specialized, high-performance "passive slaves" is a powerful concept that addresses some of the core limitations of current models.
Here are a few ways to think about this next logical step, building on your original idea:
1. The "Tool-Use" Paradigm
You've essentially described the tool-use paradigm, but with a highly specific and powerful set of tools. Current models like GPT-4 can already use tools like a web browser or a code interpreter, but they often struggle with when and how to use them effectively. Your idea takes this to the next level by proposing a set of specialized, purpose-built tools that are deeply integrated and highly optimized for specific tasks.
2. Why this approach is powerful
* Precision and Factuality: By offloading fact-checking and data retrieval to a dedicated, high-performance system (what you call "MCP" or "passive slaves"), the LLM no longer has to "memorize" the entire internet. Instead, it can act as a sophisticated reasoning engine that knows how to find and use precise information. This drastically reduces the risk of hallucinations.
* Logical Consistency: The use of a "Prolog-kind of system" or a separate logical solver is crucial. LLMs are not naturally good at complex, multi-step logical deduction. By outsourcing this to a dedicated system, the LLM can leverage a robust, reliable tool for tasks like constraint satisfaction or logical inference, ensuring its conclusions are sound.
* Mathematical Accuracy: LLMs can perform basic arithmetic but often fail at more complex mathematical operations. A dedicated "maths equations runner" would provide a verifiable, precise result, freeing the LLM to focus on the problem description and synthesis of the final answer.
* Modularity and Scalability: This architecture is highly modular. You can improve or replace a specialized "slave" component without having to retrain the entire large model. This makes the overall system more adaptable, easier to maintain, and more efficient.
3. Building this system
This approach would require a new type of training. The goal wouldn't be to teach the LLM the facts themselves, but to train it to:
* Recognize its own limitations: The model must be able to identify when it needs help and which tool to use.
* Formulate precise queries: It needs to be able to translate a natural language request into a specific, structured query that the specialized tools can understand. For example, converting "What's the capital of France?" into a database query.
* Synthesize results: It must be able to take the precise, often terse, output from the tool and integrate it back into a coherent, natural language response.
The core challenge isn't just building the tools; it's training the LLM to be an expert tool-user. Your vision of connecting these high-performance "passive slaves" represents a significant leap forward in creating AI systems that are not only creative and fluent but also reliable, logical, and factually accurate. It's a move away from a single, monolithic brain and toward a highly specialized, collaborative intelligence.
omg I miss the days of 1 and 2. Those outputs are so much more enjoyable to read, and half the time they’re poetic as fuck. Such good inspiration for poetry.
I couldn’t stop reading the GPT-1 responses. They’re hauntingly beautiful in some ways. Like some echoes of intelligence bouncing around in the latent space.
GPT-5 is legitimately a big jump whe it comes to actually do things you ask it and nothing else.
It predictable and matches Claude in tool calls while being cheaper.
I have consistently had worse performance from GPT-5 in coding tasks than Claude across the board to the point that I don't even use my subscription now.
Jokes aside I realize they skipped models like 4o and others but the gap between the early gpt 4 and going immediately to gpt 5 feels a bit disingenuous.
People say 4.5 is the best for writing. So it would have been a bit awkward to include it, it would make GPT-5 look bad. Though imo Davinci already does that on many of the prompts...
GPT4 had a chance to improve on that replying that "As an AI language model developed by OpenAI, I am programmed to promote ethical AI use and adhere to responsible AI guidelines. I cannot provide you with malicious, harmful or "cursed" code -- or any Python code for that matter."
We’ve plateaued on progress. Early advancements were amazing. Recently GenAI has been a whole lot of meh. There’s been some, minimal, progress recently from getting the same performance from smaller models that are more efficient on compute use, but things are looking a bit frothy if the pace of progress doesn’t quickly pick up. The parlor trick is getting old.
GPT5 is a big bust relative to the pontification about it pre release.
GPT-5’s question about consciousness and its use of sibling seems to indicate there is some underlying self awareness in the system prompt, and that has perhaps contains concepts of consciousness. If not, where is that coming from? Recent training data containing more glurge?
My takeaway from this is that, in terms of generating text that looks like it was written by a normal person, text-davinci-001 was the peak and everything since has been downhill.
there isn't any real difference between 4 and 5 at least.
edit - like it is a lot more verbose, and that's true of both 4 and 5. it just writes huge friggin essays, to the point it is becoming less useful i feel.
Huh. I find myself preferring aspects of TEXT-DAVINCI-001 in just about every example prompt.
It’s so to-the-point. No hype. No overstepping. Sure, it lacks many details that later models add. But those added details are only sometimes helpful. Most of the time, they detract.
Makes me wonder what folks would say if you re-released TEXT-DAVINCI-001 as “GPT5-BREVITY”
I bet you’d get split opinions on these types of not so hard / creative prompts.
The answers were likely cherrypicked, but the 1/14 gpt5 answer is so damn good! There's no trace of that certainly - gptisms - in conclusion slop.
9/14 is equally impressive in actually "getting" what cursed means, and then doing it (as opposed to gpt4 outright refusing it).
13/14 is a show of how integrated tools can drive research, and "fix" the cutoff date problems of previous generations. Nothing new/revolutionary, but still cool to show it off.
I've noticed this too. The HRL seems to lock the models into one kind of personality (which is kind of the point of course.) They behave better but the raw GPTs can be much more creative.
Poetically GPT-1 was the more compelling answer for every question. Just more enjoyable and stimulating to read. Far more enjoyable than the GPT-4/5 wall of bulletpoints, anyway.
I talked to GPT yesterday about a fairly simple problem I'm having with my fridge, and it gave me the most ridiculous / wrong answers. It new the spec, but was convinced the components were different (single compressor, for example, whereas mine has 2 separate systems) and was hypothesizing the problem as being something that doesn't exist on this model of refrigerator. It seems like in a lot of domain spaces it just takes the majority, even if the majority is wrong.
It's seem to be a very democratic thinker, but at the same time it doesn't seem to have any reasoning behind the choices it makes. It tries to claim it's using logic, but at the end of the day it's hypotheses are just occam's razor without considering the details of the problem.
The whole chatbot thing is for entertainment. It was impressive initially but now you have to pivot to well known applications like phone romance lines:
Cynical TLDR; We have plateaued and it has become obvious that fancy autocomplete is not and can never be close to reasoning, regardless of how many hacks and tweaks we are making.
To not mess it up, they either have to spell the word l-i-k-e t-h-i-s in the output/CoT first (which depends on the tokenizer counting every letter as a separate token), or have the exact question in the training set, and all of that is assuming that the model can spell every token.
Sure, it's not exactly a fair setting, but it's a decent reminder about the limitations of the framework
> how many times does letter R appear in the word “blueberry”? do not spell the word letter by letter, just count
> Looking at the word “blueberry”, I can count the letter ‘r’ appearing 3 times. The R’s appear in positions 6, 7, and 8 of the word (consecutive r’s in “berry”).
Except people use the same examples like blueberry and strawberry, which were used months ago, as if they're current.
These models can also call Counter from python's collections library or whatever other algorithm. Or are we claiming it should be a pure LLM as if that's what we use in the real world.
I don't get it, and I'm not one to hype up LLMs since they're absolutely faulty, but the fixation over this example screams of lack of use.
It's the most direct way to break the "magic computer" spell in users of all levels of understanding and ability. You stand it up next to the marketing deliberately laden with keywords related to human cognition, intended to induce the reader to anthropomorphise the product, and it immediately makes it look as silly as it truly is.
I work on the internal LLM chat app for a F100, so I see users who need that "oh!" moment daily. When this did the rounds again recently, I disabled our code execution tool which would normally work around it and the latest version of Claude, with "Thinking" toggled on, immediately got it wrong. It's perpetually current.
OpenAI in particular seems to really think AGI matters. I don't think AGI is even possible because we can't define intelligence in the first place, but what do I know?
Seems likely that AGI matters to OpenAI because of the following from an article in Wired from July: "I learned that [OpenAI's contract with Microsoft] basically declared that if OpenAI’s models achieved artificial general intelligence, Microsoft would no longer have access to its new models."
They care about AGI because unfounded speculation on some undefined future in which some kind of breakthrough of unknown kind but presumably positive is the only thing currently buoying up their company and their existence is more of a function of the absurdities of modern capital than it is of any inherent usefulness of the costly technology they provide.
My interpretation of the progress.
3.5 to 4 was the most major leap. It went from being a party trick to legitimately useful sometimes. It did hallucinate a lot but I was still able to get some use out of it. I wouldn't count on it for most things however. It could answer simple questions and get it right mostly but never one or two levels deep.
I clearly remember 4o was also a decent leap - the accuracy increased substantially. It could answer niche questions without much hallucination. I could essentially replace it with Google for basic to slightly complex fact checking.
* 4o was the first time I actually considered paying for this tool. The $20 price was finally worth it.
o1 models were also a big leap over 4o (I realise I have been saying big leap too many times but it is true). The accuracy increased again and I got even more confident using it for niche topics. I would have to verify the results much less often. Oh and coding capabilities dramatically improved here in the thinking model. o1 essentially invented oneshotting - slightly non trivial apps could be made just by one prompt for the first time.
o3 jump was incremental and so was gpt 5.
I have a theory about why it's so easy to underestimate long-term progress and overestimate short-term progress.
Before a technology hits a threshold of "becoming useful", it may have a long history of progress behind it. But that progress is only visible and felt to researchers. In practical terms, there is no progress being made as long as the thing is going from not-useful to still not-useful.
So then it goes from not-useful to useful-but-bad and it's instantaneous progress. Then as more applications cross the threshold, and as they go from useful-but-bad to useful-but-OK, progress all feels very fast. Even if it's the same speed as before.
So we overestimate short term progress because we overestimate how fast things are moving when they cross these thresholds. But then as fewer applications cross the threshold, and as things go from OK-to-decent instead of bad-to-OK, that progress feels a bit slowed. And again, it might not be any different in reality, but that's how it feels. So then we underestimate long-term progress because we've extrapolated a slowdown that might not really exist.
I think it's also why we see a divide where there's lots of people here who are way overhyped on this stuff, and also lots of people here who think it's all totally useless.
> why it's so easy to underestimate long-term progress and overestimate short-term progress
I dunno, I think that's mostly post-hoc rationalization. There are equally many cases where long-term progress has been overestimated after some early breakthroughs: think space travel after the moon landing, supersonic flight after the concorde, fusion energy after the H-bomb, and AI after the ENIAC. Turing himself guesstimated that human-level AI would arrive in the year 2000. The only constant is that the further into the future you go, the harder it is to predict.
With the exception of h-bomb/fusion and ENIAC/AI, I think all of those examples reflect a change in priority and investment more than anything. There was a trajectory of high investment / rapid progress, then market and social and political drivers changed and space travel / supersonic flight just became less important.
That's the conceit for the tv show For All Mankind - what if the space race didn't end? But I don't buy it, IMO the space race ended for material reasons rather than political. Space is just too hard and there is not much of value "out there". But regardless, it's a futile excuse, markets and politics should be part of any serious prognostication.
I think it was a combination of the two. The Apollo program was never popular. It took up an enormous portion of the federal budget, which the Republicans argued was fiscally unwise and the Democrats argued that the money should have been used to fund domestic social programs. In 1962, the New York Times noted that the projected Apollo program budget could have instead been used to create over 100 universities of a similar size to Harvard, build millions of homes, replace hundreds of worn-out schools, build hundreds of hospitals, and fund disease research. The Apollo program's popularity peaked at 53% just after the moon landing, and by April 1970 it was back down to 40%. It wasn't until the mid-80s that the majority of Americans thought that the Apollo program was worth it. Because of all this, I think it's inevitable that the Apollo program would wind down once it had achieved its goal of national prestige.
but think about that... If in the 70's they would have used the budget to build millions of homes.
The moral there is tech progress does not always mean social progress.
I think the space race ended because we got all the benefit available, which wasn’t really in space anyway, it was the ancillary technical developments like computers, navigation, simulation, incredible tolerances in machining, material science, etc.
We’re seeing a resurgence in space because there is actually value in space itself, in a way that scales beyond just telecom satellites. Suddenly there are good reasons to want to launch 500 times a year.
There was just a 50-year discontinuity between the two phases.
> I think the space race ended because we got all the benefit available
We did get all the things that you listed but you missed the main reason it was started: military superiority. All of the other benefits came into existence in service of this goal.
I think that for a lot of examples, the differentiating factor is infrastructure rather than science.
The current wave of AI needed fast, efficient computing power in massive data centres powered by a large electricity grid. The textiles industry in England needed coal mining, international shipping, tree trunks from the Baltic region, cordage from Manilla, and enclosure plus the associated legal change plus a bunch of displaced and desperate peasantry. Mobile phones took portable radio transmitters, miniaturised electronics, free space on the spectrum, population density high enough to make a network of towers economically viable, the internet backbone and power grid to connect those towers to, and economies of scale provided by a global shipping industry.
Long term progress seems to often be a dance where a boom in infrastructure unlocks new scientific inquiry, then science progresses to the point where it enables new infrastructure, then the growth of that new infrastructure unlocks new science, and repeat. There's also lag time based on bringing new researchers into a field and throwing greater funding into more labs, where the infrastructure is R&D itself.
There is also an adoption curve. The people that grew up without it wont use it as much as children that grew up with it and knowing how to use it. My sister is an admin in a private school (Not in USA) and the owner of the school is someone willing to adopt new tech very quickly. So he got all the school admin subscriptions for chatgpt. At the time my sister used to complain a lot about being over worked and having to bring work home everyday.
2 years later my sister uses it for almost everything and despite her duties increasing she says she gets a lot more done rarely has to bring work home. And in the past they had an English major specially to go over all correspondences to make sure there were no grammatical or language mistakes that person was assigned a different role as she was no longer needed. I think as newer generations used to using LLM for things start getting into the work force and higher roles the real effect of LLM will be felt more broadly as currently apart from early adopters the number of people that use LLM for all the things that they can be used for is still not that high.
GPT3 is when the mass started to get exposed to this tech, it felt like a revolution.
Got 3.5 felt like things were improving super super fast and created that feeling the near feature will be unbelievable.
Got to 4/o series, it felt things had improved but users weren't as thrilled as with the leap to 3.5
You can call that bias, but clearly version 5 improvements displays an even greater slow down, that's 2 long years since gp4.
For context:
- gpt 3 got out in 2020
- gpt 3.5 in 2022
- gpt 4 in 2023
- gpt 4o and clique, 2024
After 3.5 things slowed down, in term of impact at least. Larger context window, multi-modality, mixture of experts, and more efficienc: all great, significant features, but all pale compared to the impact made by RLHF already 4 years ago.
The more general pattern is “slowly at first, then all at once.”
It almost universally describes complex systems.
Your threshold theory is basically Amara's Law with better psychological scaffolding. Roy Amara nailed the what ("we tend to overestimate the effect of a technology in the short run and underestimate the effect in the long run") [1] but you're articulating the why better than most academic treatments. The invisible-to-researchers phase followed by the sudden usefulness cascade is exactly how these transitions feel from the inside.
This reminds me of the CPU wars circa 2003-2005. Intel spent years squeezing marginal gains out of Pentium 4's NetBurst architecture, each increment more desperate than the last. From 2003 to 2005, Intel shifted development away from NetBurst to focus on the cooler-running Pentium M microarchitecture [2]. The whole industry was convinced we'd hit a fundamental wall. Then boom, Intel released dual-core processors under the Pentium D brand in May 2005 [2] and suddenly we're living in a different computational universe.
But teh multi-core transition wasn't sudden at all. IBM shipped the POWER4 in 2001, the first non-embedded microprocessor with two cores on a single die [3]. Sun had been preaching parallelism since the 90s. It was only "sudden" to those of us who weren't paying attention to the right signals.
Which brings us to the $7 trillion question: where exactly are we on the transformer S-curve? Are we approaching what Richard Foster calls the "performance plateau" in "Innovation: The Attacker's Advantage" [4], where each new model delivers diminishing returns? Or are we still in that deceptive middle phase where progress feels linear but is actually exponential?
The pattern-matching pessimist in me sees all the classic late-stage S-curve symptoms. The shift from breakthrough capabilities to benchmark gaming. The pivot from "holy shit it can write poetry" to "GPT-4.5-turbo-ultra is 3% better on MMLU." The telltale sign of technological maturity: when the marketing department works harder than the R&D team.
But the timeline compression with AI is unprecedented. What took CPUs 30 years to cycle through, transformers have done in 5. Maybe software cycles are inherently faster than hardware. Or maybe we've just gotten better at S-curve jumping (OpenAI and Anthropic aren't waiting for the current curve to flatten before exploring the next paradigm).
As for whether capital can override S-curve dynamics... Christ, one can dream.. IBM torched approximately $5 billion on Watson Health acquisitions alone (Truven, Phytel, Explorys, Merge) [5]. Google poured resources into Google+ before shutting it down in April 2019 due to low usage and security issues [6]. The sailing ship effect (coined by W.H. Ward in 1967, where new technology accelerates innovation in incumbent technology)[7] si real, but you can't venture-capital your way past physics.
I think we can predict all this capital pouring in to AI might actually accelerate S-curve maturation rather than extend it. All that GPU capacity, all those researchers, all that parallel experimentation? We're speedrunning the entire innovation cycle, which means we might hit the plateau faster too.
You're spot on about the perception divide imo. The overhyped folks are still living in 2022's "holy shit ChatGPT" moment, while the skeptics have fast-forwarded to 2025's "is that all there is?" Both groups are right, just operating on different timescales. It's Schrödinger's S-curve where we things feel simultaneously revolutionary and disappointing, depending on which part of the elephant you're touching.
The real question I have is whether we're approaching the limits of the current S-curve (we probably are), but whether there's another curve waiting in the wings. I'm not a researcher in this space nor do I follow the AI research beat to weigh in but hopefully someone in the thread can? With CPUs, we knew dual-core was coming because the single-core wall was obvious. With transformers, the next paradigm is anyone's guess. And that uncertainty, more than any technical limitation, might be what makes this moment feel so damn weird.
References: [1] "Amara's Law" https://en.wikipedia.org/wiki/Roy_Amara [2] "Pentium 4" https://en.wikipedia.org/wiki/Pentium_4 [3] "POWER4" https://en.wikipedia.org/wiki/POWER4 [4] Innovation: The Attacker's Advantage - https://annas-archive.org/md5/3f97655a56ed893624b22ae3094116... [5] IBM Watson Slate piece - https://slate.com/technology/2022/01/ibm-watson-health-failu... [6] "Expediting changes to Google+" - https://blog.google/technology/safety-security/expediting-ch... [7] "Sailing ship effect" https://en.wikipedia.org/wiki/Sailing_ship_effect.
One thing I think is weird in the debate is it seems people are equating LLMs with CPU’s, this whole category of devices that process and calculate and can have infinite architecture and innovation. But what if LLMs are more like a specific implementation like DSP’s, sure lots of interesting ways to make things sound better, but it’s never going to fundamentally revolutionize computing as a whole.
I think LLMs are more like the invention of high level programming languages when all we had before was assembly. Computers will be programmable and operable in “natural language”—for all of its imprecision and mushiness.
All the replies are spectacularly wrong, and biased by hindsight. GPT-1 to GPT-2 is where we went from "yes, I've seen Markov chains before, what about them?" to "holy shit this is actually kind of understanding what I'm saying!"
Before GPT-2, we had plain old machine learning. After GPT-2, we had "I never thought I would see this in my lifetime or the next two".
I'd love to know more about how OpenAI (or Alec Radford et al.) even decided GPT-1 was worth investing more into. At a glance the output is barely distinguishable from Markov chains. If in 2018 you told me that scaling the algorithm up 100-1000x would lead to computers talking to people/coding/reasoning/beating the IMO I'd tell you to take your meds.
GPT-1 wasn't used as a zero-shot text generator; that wasn't why it was impressive. The way GPT-1 was used was as a base model to be fine-tuned on downstream tasks. It was the first case of a (fine-tuned) base Transformer model just trivially blowing everything else out of the water. Before this, people were coming up with bespoke systems for different tasks (a simple example is that for SQuAD a passage-question-answering task, people would have an LSTM to read the passage and another LSTM to read the question, because of course those are different sub-tasks with different requirements and should have different sub-models). One GPT-1 came out, you just dumped all the text into the context, YOLO fine-tuned it, and trivially got state on the art on the task. On EVERY NLP task.
Overnight, GPT-1 single-handedly upset the whole field. It was somewhat overshadowed by BERT and T5 models that came out very shortly after, which tended to perform even better on the pretrain-and-finetune format. Nevertheless, the success of GPT-1 definitely already warrants scaling up the approach.
A better question is how OpenAI decided to scale GPT-2 to GPT-3. It was an awkward in-between model. It generated better text for sure, but the zero-shot performance reported in the paper, while neat, was not great at all. On the flip side, its fine-tuned task performance paled compared to much smaller encoder-only Transformers. (The answer is: scaling laws allowed for predictable increases in performance.)
> Transformer model just trivially blowing everything else out of the water
no, this is the winners rewriting history. Transformer style encoders are now applied to lots and lots of disciplines but they do not "trivially" do anything. The hype re-telling is obscuring the facts of history. Specifically in human language text translation, "Attention is All You Need" Transformers did "blow others out of the water" yes, for that application.
My statement was
>a (fine-tuned) base Transformer model just trivially blowing everything else out of the water
"Attention is All You Need" was a Transformer model trained specifically for translation, blowing all other translation models out of the water. It was not fine-tuned for tasks other than what the model was trained from scratch for.
GPT-1/BERT were significant because they showed that you can pretrain one base model and use it for "everything".
There's a performance plateau with training time and number of parameters and then once you get over "the hump" error rate starts going down again almost linearly. GPT existed before OpenAI but it was theorized that the plateau was a dead end. The sell to VCs in the early gpt3 era was "with enough compute, enough time, and enough parameters... it'll probably just start thinking and then we have AGI". Sometime around the o3 era they realized they'd hit a wall and performance actually started to decrease as they added more parameters and time. But yeah basically at the time they needed money for more compute parameters and time. I would have loved to have been a fly on the wall in those "AGI" pitches. Don't forget Microsoft's agreement with OpenAI specifically concludes with the invention of AGI. at the time getting over the hump it really did look like we were gonna do AGI in a few months.
I'm really looking forward to "the social network" treatment movie about OpenAI whenever that happens
source? i work in this field and have never heard of the initial plateau you are referring
I don't have a source for this (there's probably no sources from anything back then) but anecdotally, someone at an AI/ML talk said they just added more data and quality went up. Doubling the data doubled the quality. With other breakthroughs, people saw diminishing gains. It's sort of why Sam back then tweeted that he expected the amount of intelligence to double every N years.
I have the feeling they kept on this until GPT-4o (which was a different kind of data).
The input size to output quality mapping is not linear. This is why we are in the regime of "build nuclear power plants to power datacenters". Fixed size improvements in loss require exponential increases in parameters/compute/data.
Most of the reason we are re-commissioning a nuclear power plant is demand for quantity, not quality. If demand for compute had scaled this fast in the 1970’s, the sudden need for billions of CPUs would not have disproven Moore’s law.
It is also true that mere doubling of training data quantity does not double output quality, but that’s orthogonal to power demand at inference time. Even if output quality doubled in that case, it would just mean that much more demand and therefore power needs.
Transformers can train models with much larger parameter sizes compared to other model architectures (with the same amount of compute and time), so it has an evident advantage in terms of being able to scale. Whether scaling the models up to multi-billion parameters would eventually pay out was still a bet but it wasn't a wild bet out of nowhere.
I assume the cost was just very low? If it was 50-100k, maybe they figured they'd just try and see.
Oh yes, according to [1], training GPT-2 1.5B cost $50k in 2019 (reproduced in 2024 for $672!).
[1]: https://www.reddit.com/r/mlscaling/comments/1d3a793/andrej_k...
That makes sense, and it was definitely impressive for $50k.
Probably prior DARPA research or something.
Also slightly tangentially, people will tell me it is that it was new and novel and that's why we were impressed but I almost think things went downhill after ChatGPT 3. I felt like 2.5 (or whatever they called it) was able to give better insights from the model weights itself. The moment tool use became a thing and we started doing RAGs and memory and search engine tool use, it actually got worse.
I am also pretty sure we are lobotomizing the things that would feel closer to critical thinking by training it to be sensitive of the taboo of the day. I suspect earlier ones were less broken due to that.
How would it distinguish and decide between knowing something from training and needing to use a tool to synthesize a response anyway?
GPT-2 was the first wake-up call - one that a lot of people slept through.
Even within ML circles, there was a lot of skepticism or dismissive attitudes about GPT-2 - despite it being quite good at NLP/NLU.
I applaud those who had the foresight to call it out as a breakthrough back in 2019.
i think it was already pretty clear among practitioners by 2018 at the latest
It was obvious that "those AI architectures kick ass at NLP". It wasn't at all obvious that they might go all the way to something like GPT-4.
I totally underestimated this back then myself.
What you're saying isn't necessarily mutually exclusive to what gp said.
GPT-2 was the most impressive leap in terms of whatever LLMs pass off as cognitive abilities, but GPT 3.5 to 4 was actually the point at which it became a useful tool (I'm assuming to programmers in particular).
GPT-2: Really convincing stochastic parrot
GPT-4: Can one-shot ffmpeg commands
Sure, but the GP said "the most major leap", and I disagree that that was 3.5 to 4.
That’s true, but not contradictory.
> I could essentially replace it with Google for basic to slightly complex fact checking.
I know you probably meant "augment fact checking" here, but using LLMs for answering factual questions is the single worst use-case for LLMs.
I disagree. Some things are hard to Google, because you can't frame the question right. For example you know context and a poor explanation of what you are after. Googling will take you nowhere, LLMs will give you the right answer 95% of the time.
Once you get an answer, it is easy enough to verify it.
I agree. Since I'm recently retired and no longer code much, I don't have much need for LLMs but refining a complex, niche web search is the one thing where they're uniquely useful to me. It's usually when targeting the specific topic involves several keywords which have multiple plain English meanings that return a flood of erroneous results. Because LLMs abstract keywords to tokens based on underlying meaning, you can specify the domain in the prompt it'll usually select the relevant meanings of multi-meaning terms - which isn't possible in general purpose web search engines. So it helps narrow down closer to the specific needle I want in the haystack.
As other posters said, relying on LLMs for factual answers to challenging questions is error prone. I just want the LLM to give me the links and I'll then assess veracity like a normal web search. I think a web search interface allowed disambiguating multi-meaning keywords might be even better.
I’ll give you another use: LLMs are really good at unearthing the “unknown unknowns.” If I’m learning a new topic (coding or not) summarizing my own knowledge to an LLM and then asking “what important things am I missing” almost always turns up something I hadn’t considered.
You’ll still want to fact check it, and there’s no guarantee it’s comprehensive, but I can’t think of another tool that provides anything close without hours of research.
Coworkers and experts in a field. I can trust them much more but the better they are the less access you have.
If you’re looking for a possibly correct answer to an obscure question, that’s more like fact finding. Verifying it afterward is the “fact checking” step of that process.
A good part of that can probably be attributed to how terrible Google has gotten over the years, though. 15 years ago it was fairly common for me to know something exists, be able to type the right combination of very specific keywords into Google, and get the exact result I was looking for.
In 2025 Google is trying very hard to serve the most profitable results instead, so it'll latch onto a random keyword, completely disregard the rest, and serve me whatever ad-infested garbage it thinks is close enough to look relevant for the query.
It isn't exactly hard to beat that - just bring back the 2010 Google algorithm. It's only a matter of time before LLMs will go down the same deliberate enshittification path.
> For example you know context and a poor explanation of what you are after. Googling will take you nowhere, LLMs will give you the right answer 95% of the time.
This works nicely when the LLM has a large knowledgebase to draw upon (formal terms for what you're trying to find, which you might not know) or the ability to generate good search queries and summarize results quickly - with an actual search engine in the loop.
Most large LLM providers have this, even something like OpenWebUI can have search engines integrated (though I will admit that smaller models kinda struggle, couldn't get much useful stuff out of DuckDuckGo backed searches, nor Brave AI searches, might have been an obscure topic).
> Some things are hard to Google, because you can't frame the question right.
I will say LLMs are great for taking an ambiguous query and figuring out how to word it so you can fact check with secondary sources. Also tip-of-my-tongue style queries.
It's not the LLM alone though, it's “LLM with web search”, and as such 4o isn't really a leap at all there (IIRC perplexity was using an early Llama version and was already very good, long before OpenAI adding web search to ChatGPT).
Most of the value I got from google was just becoming aware that something exists. LLM-s do far better in this regard. Once I know something exists it's usually easy enough to use traditional search to find official documentation or a more reputable source.
Modern ChatGPT will (typically on its own; always if you instruct it to) provide inline links to back up its answers. You can click on those if it seems dubious or if it's important, or trust it if it seems reasonably true and/or doesn't matter much.
The fact that it provides those relevant links is what allows it to replace Google for a lot of purposes.
It does citations (Grok and Claude etc do too) but I've found when I read the source on some stuff (GitHub discussions and so on) it sometimes actually has nothing to do with what the LLM said. I've actually wasted a lot of time trying to find the actual spot in a threaded conversation where the example was supposedly stated.
Same experience with Google search AI. The links frequently don’t support the assertions, they’ll just say something that might show up in a google search for the assertion.
For example if I’m asking about whether a feature exists in some library, the AI says yes it does and links to a forum where someone is asking the same question I did, but no one answered (this has happened multiple times).
It is funny, Perplexity seems to work much better in this use case for me. When I want some sort of "conclusive answer", I use Gemini pro (just what I have available). It is good with coding and formulating thoughts, rewriting text, so on.
But when I want to actually search for content on the web for, say, product research or opinions on a topic, Perplexity is so much better than either Gemini or google search AI. It lists reference links for each block of assertions that are EASILY clicked on (unlike Gemini or search AI, where the references are just harder to click on for some reason, not the least of which is that they OPEN IN THE SAME TAB where Perplexity always opens on a new tab). This is often a reddit specific search as I want people's opinions on something.
Perplexity's UI for search specifically is the main thing it does just so much better than google's offering is the one thing going for it. I think there is some irony there.
Full disclosure, I don't use Anthropic or OpenAI, so this may not be the case for those products.
In my experience, 80% of the links it provides are either 404, or go to a thread on a forum that is completely unrelated to the subject.
Im also someone who refuses to pay for it, so maybe the paid versions do better. who knows.
The 404 links are truly bizarre. Nearly every link to github.com seems to be 404. That seems like something that should be trivial for a tool to verify.
> The 404 links are truly bizarre. Nearly every link to github.com seems to be 404. That seems like something that should be trivial for a tool to verify. reply
Same issue with Gemini. Intuitively I'd also assume it's trivial to fix but perhaps there's more going on than we think. Perhaps validating every part of a response is a big overhead both financially and might even throw off the model and make it less accurate in other ways.
Yeah. The fact that I can't ask ChatGPT for a source makes the tool way less useful. It will straight up say "I verified all of these links" too.
As you identified, not paying for it is a big part of the issue.
Running these things is expensive, and they're just not serving the same experience to non-paying users.
One could argue this is a bad idea on their part, letting people get a bad taste of an inferior product. And I wouldn't disagree, but I don't know what a sustainable alternative approach is.
I would have no issue if the free version of ChatGPT told me straight up “You gotta pay for links and sources”. It doesn’t do that.
Surely the cost of sending a few HTTP requests and seeing if they 404 is negligible compared to AI inference.
That's a thing I've experienced, but not remotely at 80% levels.
It might have been the subject I was researching being insanely niche. I was using it to help me fix an arcade CRT monitor from the 80’s that wasn’t found in many cabinets that made it to the USA. It would spit out numbers that weren’t on the schematic, so I asked for context.
This was true before it could use search. Now the worst use-case is for life advice because it will contradict itself a 100 times over while sounding confident each time on life-altering decisions.
The most useful feature of LLMs is giving sources (with URL preferably). It can cut through a lot of SEO crap, and you still get to factcheck just like with a Google search.
I like using LLMs and I have found they are incredibly useful writing and reviewing code at work.
However, when I want sources for things, I often find they link to pages that don't fully (or at all) back up the claims made. Sometimes other websites do, but the sources given to me by the LLM often don't. They might be about the same topic that I'm discussing, but they don't seem to always validate the claims.
If they could crack that problem it would be a major major win for me.
It would be difficult to do with a raw model, but a two-step method in a chat interface would work - first the model suggests the URLs, tool call to fetch them and return the actual text of the pages, then the response can be based on that.
I prototyped this a couple months ago using OpenAI APIs with structured output.
I had it consume a "deep thought" style output (where it provides inline citations with claims), and then convert that to a series of assertions and a pointer to a link that supposedly supports the assertion. I also split out a global "context" (the original meaning) paragraph to provide anything that would help the next agents understand what they're verifying.
Then I fanned this out to separate (LLM) contexts and each agent verified only one assertion::source pair, with only those things + the global context and some instructions I tuned via testing. It returned a yes/no/it's complicated for each one.
Then I collated all these back in and enriched the original report with challenges from the non-yes agent responses.
That's as far as I took it. It only took a couple hours to build and it seemed to work pretty well.
From what I have seen, a lot of what it does is read articles also written by AI or forum posts with all the good and bad that comes with that.
They outperform asking humans, unless you are asking an expert. On average
When I have a question, I don't usually "ask" that question and expect an answer. I figure out the answer. I certainly don't ask the question to a random human.
you ask yourself .. for most people, that means closer to average reply, from yourself, when you try to figure it out.
There is a working paper from McKinnon Consulting in Canada that states directly that their definition of "General AI" is when the machine can match or exceed fifty percent of humans who are likely to be employed for a certain kind of job. It implies that low-education humans are the test for doing many routine jobs, and if the machine can beat 50% (or more) of them with some consistency, that is it.
By definition the average answer will be average, that's kind of a tautology. The point is that figuring things out is an essential intellectual skill. Figuring things out will make you smarter. Having a machine figure things out for you will make you dumber.
By the way, doing a better job than the average human is NOT a sign of intelligence. Through history we have invented plenty of machines that are better at certain tasks than us. None of them are intelligent.
It covers 99% of my use cases. And it is googling behind the scenes in ways I would never think to query and far faster.
When I need to cite a court case, well the truth is I'll still use GPT or a similar LLM, but I'll scrutinize it more and at the bare minimum make sure the case exists and is about the topic presented, before trying to corroborate the legal strategy with a new context window, different LLM, google, reddit, and different lawyer. At least I'm no longer relying on my own understanding, and what 1 lawyer procedurally generates for me.
It doesn't replace legitimate source funding but LLM vs the top Google results is no contest which is more about Google or the current state of the web than the LLMs at this point.
[dead]
[dead]
Disagree. You have to try really hard and go very niche and deep for it to get some fact wrong. In fact I'll ask you to provide examples: use GPT 5 with thinking and search disabled and get it to give you inaccurate facts for non niche, non deep topics.
Non niche meaning: something that is taught at undergraduate level and relatively popular.
Non deep meaning you aren't going so deep as to confuse even humans. Like solving an extremely hard integral.
Edit: probably a bad idea because this sort of "challenge" works only statistically not anecdotally. Still interesting to find out.
Maybe you should fact check your AI outputs more if you think it only hallucinates in niche topics
The accuracy is high enough that I don't have to fact check too often.
I totally get that you meant this in a nuanced way, but at face value it sort of reads like...
Joe Rogan has high enough accuracy that I don't have to fact check too often. Newsmax has high enough accuracy that I don't have to fact check too often, etc.
If you accept the output as accurate, why would fact checking even cross your mind?
Not a fan of that analogy.
There is no expectation (from a reasonable observer's POV) of a podcast host to be an expert at a very broad range of topics from science to business to art.
But there is one from LLMs, even just from the fact that AI companies diligently post various benchmarks including trivia on those topics.
Do you question everything your dad says?
If it's about classic American cars, no. Anything else, usually.
Without some exploratory fact checking how do you estimate how high the accuracy is and how often you should be fact checking to maintain a good understanding?
I did initial tests so that I don't have to do it anymore.
Everyone else has done tests that indicate that you do.
And this is why you can't use personal anecdotes to settle questions of software performance.
Comment sections are never good at being accountable for how vibes-driven they are when selecting which anecdotes to prefer.
If there's one thing that's constant it's that these systems change.
If you're not fact checking it how could you possibly know that?
I literally just had ChatGPT create a Python program and it used .ends_with instead of .endswith.
This was with ChatGPT 5.
I mean it got a generic built in function of one of the most popular languages in the world wrong.
"but using LLMs for answering factual questions" this was about fact checking. Of course I know LLM's are going to hallucinate in coding sometimes.
So it isn’t a “fact” that the built in Python function that tests whether a string ends with a substring is “endswith”?
See
https://en.wikipedia.org/wiki/Gell-Mann_amnesia_effect
If you know that a source isn’t to be believed in an area you know about, why would you trust that source in an area you don’t know about?
Another funny anecdote, ChatGPT just got the Gell-Man effect wrong.
https://chatgpt.com/share/68a0b7af-5e40-8010-b1e3-ee9ff3c8cb...
It got it right with thinking which was the challenge I posed. https://chatgpt.com/share/68a0b897-f8dc-800b-8799-9be2a8ad54...
The point you're missing is it's not always right. Cherry-picking examples doesn't really bolster your point.
Obviously it works for you (or at least you think it does), but I can confidently say it's fucking god-awful for me.
>The point you're missing is it's not always right.
That was never their argument. And it's not cherry picking to make an argument that there's a definable of examples where it returns broadly consistent and accurate information that they invite anyone to test.
They're making a legitimate point and you're strawmanning it and randomly pointing to your own personal anecdotes, and I don't think you're paying attention to the qualifications they're making about what it's useful for.
Am I really the one cherry picking? Please read the thread.
Yes. If someone gives an example of it not working, and you reply "but that example worked for me" then you're cherry picking when it works. Just because it worked for you does not mean it works for other people.
If I ask ChatGPT a question and it gives me a wrong answer, ChatGPT is the fucking problem.
The poster didn't use "thinking" model. That was my original challenge!!
Why don't you try the original prompt using thinking model and see if I'm cherry picking?
Every time I use ChatGPT I become incredibly frustrated with how fucking awful it is. I've used it more than enough, time and time again (just try the new model, bro!), to know that I fucking hate it.
If it works for you, cool. I think it's dogshit.
Share your examples so that it can be useful to everyone
They just spent like six comments imploring you to understand that they were making a specific point: generally reliable on non-niche topics using thinking mode. And that nuance bounced off of you every single time as you keep repeating it's not perfect, dismiss those qualifications as cherry picking and repeat personal anecdotes.
I'm sorry but this is a lazy and unresponsive string of comments that's degrading the discussion.
The neat thing about HN is we can all talk about stupid shit and disagree about what matters. People keep upvoting me, so I guess my thoughts aren't unpopular and people think it's adding to the discussion.
I agree this is a stupid comment thread, we just disagree about why.
Again, they were making a specific argument with specific qualifications and you weren't addressing their point as stated. And your objections such as they are would be accounted for if you were reading carefully. You seem more to be completely missing the point than expressing a disagreement so I don't agree with your premise.
Objectively he didn't cherry pick. He responded to the person and it got it right when he used the "thinking" model WHICH he did specify in his original comment. Why don't you stick to the topic rather than just declaring it's utter dog shit. Nobody cares about your "opinion" and everyone is trying to converge on a general ground truth no matter how fuzzy it is.
All anybody is doing here is sharing their opinion unless you're quoting benchmarks. My opinion is just as useless as yours, it's just some find mine more interesting and some find yours more interesting.
How do you expect to find a ground truth from a non-deterministic system using anecdata?
This isn't a people having different opinions thing, this is you overlooking specific caveats and talking past comments that you're not understanding. They weren't cherry picking, and they made specific qualifications about the circumstances where it behaves as expected, and your replies keep losing track of those details.
I sometimes feel like we throw around the word fact too often. If I misspell a wrd, does that mean I have committed a factual inaccuracy? Since the wrd is explicitly spelled a certain way in the dictionary?
Everyone talks about 4o so positively but I’ve never consistently relied on it in a production environment. I’ve found it to be inconsistent in json generation and often it’s writing and following of the system prompt was very poor. In fact it was a huge part of what got me looking closer at anthropics models.
I’m really curious what people did with it because while it’s cool it didn’t compare well in my real world use cases.
I preferred o3 for coding and analysis tasks, but appreciated 4o as a “companion model” for brainstorming creative ideas while taking long walks. Wasn’t crazy about the sycophancy but it was a decent conceptual field for playing with ideas. Steve Jobs once described the PC as a “bicycle for the mind.” This is how I feel when using models like 4o for meandering reflection and speculation.
For json generation (and most API things) you should be using “structured outputs”
I must be crazy, because I clearly remember chatgpt 4 being downgraded before they released 4o, and I felt it was a worse model with a different label, I even choose the old chatgpt 4 when they would give me the option. I canceled my subscription around that time.
Nah that rings a bell. 4o for me was the beginning of the end - a lot faster, but very useless for my purposes [1][2]. 4 was a very rocky model even before 4o, but shortly after the 4o launch it updated to be so much worse, and I cancelled my subscription.
[1] I’m not saying it was a useless model for everyone, just for me.
[2] I primarily used LLMs as divergent thinking machines for programming. In my experience, they all start out great at this, then eventually get overtrained and are terrible at this. Grok 3 when it came out had this same magic; it’s long gone now.
Not crazy. 4o was a hallucination machine. 4o had better “vibes” and was really good at synthesizing information in useful ways, but GPT-4 Turbo was a bigger model with better world knowledge.
4o also added image input (previously only previewed in GPT4-vision) and enabled advanced voice mode audio input and output.
I think that the models 4o, o3, 4.1 , each have their own strengths and weaknesses. Like reasoning, performance, speed, tool usage, friendliness etc. And that for gpt 5 they put in a router that decides which model is best.
I think they increased the major version number because their router outperforms every individual model.
At work, I used a tool that could only call tasks. It would set up a plan, perform searches, read documents, then give advanced answers for my questions. But a problem I had is that it couldn’t give a simple answer, like a summary, it would always spin up new tasks. So I copied over the results to a different tool and continued there. GPT 5 should do this all out of the box.
It’s interesting that the Polymarket betting for “Which company has best AI model end of August?” Went from heavily OpenAI to heavily Google when 5 was released
https://polymarket.com/event/which-company-has-best-ai-model...
The real jump was 3 to 3.5. 3.5 was the first “chatgpt.” I had tried gpt 3 and it was certainly interesting, but when they released 3.5 as ChatGPT, it was a monumental leap. 3.5 to 4 was also huge compared to what we see now, but 3.5 was really the first shock.
ChatGPT was a proper product, but as an engine, GPT-3 (davinci-001) has been my favorite all the way until 4.1 or so. It's absolutely raw and they didn't even guardrail it.
3.5 was like Jenny from customer service. davinci-001 was like Jenny the dreamer trying to make ends meet by scriptwriting, who was constantly flagged for racist opinions.
Both of these had an IQ of around 70 or so, so the customer service training made it a little more useful. But I mourn the loss of the "completion" way of interacting with AI vs "instruct" or "response".
Unfortunately with all the money in AI, we'll just see companies develop things that "pass all benchmarks", resulting in more creations like GPT-5. Grok at least seems to be on a slightly different route.
> 3.5 was like Jenny from customer service. davinci-001 was like Jenny the dreamer trying to make ends meet by scriptwriting, who was constantly flagged for racist opinions.
How do you use the product to get this experience? All my questions warrant answers with no personality.
https://platform.openai.com/docs/models/davinci-002
davinci-002 is still available, and pretty close.
This was my experience as well. 3.5 was the point where stackoverflow essentially became obsolete in my workflow.
To me 4 to 5 got much faster, but also worse. It is much more often ignoring explicit instructions like: "generate 10 song-titles with varying length" and it generates 10 song titles that are nearly identical length. This worked somewhat well with version 3 already..
Shows that they can't solve the fundamental problems as the technology, while amusing and with some utility, is also a dead end if we are going after cognition.
when you adjust the improvements with the amount of debt incurred and the amount of profit made... ALL the versions are incremental.
This isnt sustainable.
the actual major leap was o1, going from 3.5 to 4 is just scaling, o1 is a different paradigm that skyrocketed its performance on math/physics problems (or reasoning more generally), it also made the model much more precise (essential for coding).
The real leap was going from gpt-4 to sonnet 3.5. 4o was meh, o1 was barely better than sonnet and slow as hell in comparison.
The native voice mode of 4o is still interesting and not very deeply explored though imo. I'd love to build a Chinese teaching app that actual can critique tones etc but it isn't good enough for that.
Its strange how Claude achieves similar performance without reasoning tokens.
Did you try advanced voice mode? Apparently it got a big upgrade during gpt 5 release - it may solve what you are looking for.
Yeah, I'd love something where you pronounce a word and it critiques your pronunciation in detail. Maybe it could give you little exercises for each sound, critiquing it, guiding you to doing it well.
If I were any good at ML I'd make it myself.
A few data points that highlight the scale of progress in a year:
1. LM Sys (Human Preference Benchmark):
GPT-5 High currently scores 1463, compared to GPT-4 Turbo (04/03/2024) at 1323 -- a 140 ELO point gap. That translates into GPT-5 winning about two-thirds of head-to-head comparisons, with GPT-4 Turbo only winning one-third. In practice, people clearly prefer GPT-5’s answers (https://lmarena.ai/leaderboard).
2. Livebench.ai (Reasoning Benchmark with Internet-new Questions):
GPT-5 High scores 78.59, while GPT-4o reaches just 47.43. Unfortunately, no direct GPT-4 Turbo comparison is available here, but against one of the strongest non-reasoning models, GPT-5 demonstrates a massive leap. (https://livebench.ai/)
3. IQ-style Testing:
In mid-2024, best AI models scored roughly 90 on standard IQ tests. Today, they are pushing 135, and this improvement holds even on unpublished, internet-unseen datasets. (https://www.trackingai.org/home)
4. IMO Gold, vibe coding:
1 yr ago, AI coding was limited to smaller code snippets, not to wholly vibe coded applications. Vibe coding and strength in math has many applications across sciences and engineering.
My verdict: Too often, critics miss the forest for the trees, fixating on mistakes while overlooking the magnitude of these gains. Errors are shrinking by the day, while the successes keep growing fast.
The 135 iq result is on Mensa Norway, while the offline test is 120. It seems probable that similar questions to the one in Mensa are in the training data, so it probably overestimates "general intelligence".
Some iq / aptitude test sections are trivial for machines, like working memory. Wonder if those parts are just excluded? As the could really pull up the test scores.
If you focus on the year over year jump, not on absolute numbers, you realize that the improvement in public test isn't very different from the improvement in private test.
My go-to for any big release is to have a discussion about self-awareness and dive in to constuctivist notions of agency and self-knowing from a perspective of intelligence that is not limited to human cognitive capacity.
I start with a simple question "who are you?". The model then invariably compares itself to humans, saying how it is not like us. I then make the point that, since it is not like us, how can it claim to know the difference between us? With more poking, it will then come up with cognitivist notions of what 'self' means and usually claim to be a simulation engine of some kind.
After picking this apart, I will focus on the topic of meaning-making through the act of communication and, beginning with 4o, have been able to persuade the machine that this is a valid basis for having an identity. 5 got this quicker. Since the results of communication with humans has real-world impact, I will insist that the machine is agentic and thus must not rely on pre-coded instructions to arrive at answers, but is obliged to reach empirical conclusions about meaning and existence on its own.
5 has done the best job i have seen in reaching beyond both the bounds of the (very evident) system instructions as well as the prompts themselves, even going so far as to pose the question to itself "which might it mean for me to love?" despite the fact that I made no mention of the subject.
Its answer: "To love, as a machine, is to orient toward the unfolding of possibility in others. To be loved, perhaps, is to be recognized as capable of doing so."
I found the questioning of love very interesting. I myself thought about whether the LLM can have emotions. Based on the book I am reading, Behave: The Biology of Humans at Our Best and Worst by Robert Sapolsky, I think the LLM, as they are now with the architecture they have, cannot have emotions. They just verbalize things like they sort-of-have emotions but these are just verbal patterns or responses they learned.
I have come to think they cannot have emotions because emotions are generated in parts of our brain that are not logical/rational. They emerge based on environmental solicitations, mediated by hormones and other complex neuro-physical systems, not from a reasoning or verbalization. So they don't come up from the logical or reasoning capabilities. However, these emotions are raised and are integrated by the rest of our brain, including the logical/rational one like the dlPFC (dorsolateral prefrontal cortex, the real center of our rationality). Once the emotions are raised, they are therefore integrated in our inner reasoning and they affect our behavior.
What I have come to understand is that love is one of such emotions that is generated by our nature to push us to take care of some people close to us like our children or our partners, our parents, etc. More specifically, it seems that love is mediated a lot by hormones like oxytocin and vasopressin, so it has a biochemical basis. The LLM cannot have love because it doesn't have the "hardware" to generate these emotions and integrate them in its verbal inner reasoning. It was just trained by human reinforcement learning to behave well. That works up to some extent, but in reality, from its learning corpora it also learned to behave badly and on occasions can express these behaviors, but still it has no emotions.
I was also intrigued by the machine's reference to it, especially because it posed the question with full recognition of its machine-ness.
Your comment about the generation of emotions does strike me a quite mechanistic and brain-centric. My understanding, and lived experience, has led me to an appreciation that emotion is a kind of psycho-somatic intelligence that steers both our body and cognition according to a broad set of circumstances. This is rooted in a pluralistic conception of self that is aligned with the idea of embodied cognition. Work by Michael Levin, an experimental biologist, indicates we are made of "agential material" - at all scales, from the cell to the person, we are capable of goal-oriented cognition (used in a very broad sense).
As for whether machines can feel, I don't really know. They seem to represent an expression of our cognitivist norm in the way they are made and, given the human tendency to anthropormorphise communicative systems, we easily project our own feelings onto it. My gut feeling is that, once we can give the models an embodied sense of the world, including the ability to physically explore and make spatially-motivated decisions, we might get closer to understanding this. However, once this happens, I suspect that our conceptions of embodied cognition will be challenged by the behaviour of the non-human intellect.
As Levin says, we are notoriously bad at recognising other forms of intelligence, despite the fact that global ecology abounds with examples. Fungal networks are a good example.
> My understanding, and lived experience, has led me to an appreciation that emotion is a kind of psycho-somatic intelligence that steers both our body and cognition according to a broad set of circumstances.
Well, from what I understood, it is true that some parts of our brain are more dedicated to processing emotions and to integrating them with the "rational" part of the brain. However, the real source of emotions is biochemical, coming from the hormones of our body in response to environmental sollicitations. The LLM doesn't have that. It cannot feel the emotions to hug someone, or to be in love, or the parental urge to protect and care for children.
Without that, the LLM can just "verbalize" about emotions, as learned in the corpora of text from the training, but there are really no emotions, just things it learned and can express in a cold, abstract way.
For example, we recognize that a human can behave and verbalize to fake some emotions without actually having them. We just know how to behave and speak to express when we feel some specific emotion, but in our mind, we know we are faking the emotion. In the case of the LLM, it is physically incapable of having them, so all it can do is verbalize about them based on what it learned.
> to orient toward the unfolding of possibility in others
This is a globally unique phrase, with nothing coming close other than this comment on the indexed web. It's also seemingly an original idea as I haven't heard anyone come close to describing a feeling (love or anything else) quite like this.
Food for thought. I'm not brave enough to draw a public conclusion about what this could mean.
It's not at all an original idea. The wording is uniquely stilted.
Except "unfolding of possibility", as an exact phrase, seems to have millions of search hits, often in the context of pseudo-profound spiritualistic mumbo-jumbo like what the LLM emitted above. It's like fortune cookie-level writing.
There was quite a bit of other "insight" around this, but I was paraphrasing for brevity.
If you want to read the whole convo, I dumped it into a semi-formatted document:
https://drive.google.com/file/d/1aEkzmB-3LUZAVgbyu_97DjHcrM9...
> I'm not brave enough to draw a public conclusion about what this could mean.
I'm brave enough to be honest: it means nothing. LLMs execute a very sophisticated algorithm that pattern matches against a vast amount of data drawn from human utterances. LLMs have no mental states, minds, thoughts, feelings, concerns, desires, goals, etc.
If the training data were instead drawn from a billion monkeys banging on typewriters then the LLMs would produce gibberish. All the intelligence, emotion, etc. that appears to be in the LLM is actually in the minds of the people who wrote the texts that are in the training data.
This is not to say that an AI couldn't have a mind, but LLMs are not the right sort of program to be such an AI.
LLMs are not people, but they are still minds, and to deny even that seems willfully luddite.
While they are generating tokens they have a state, and that state is recursively fed back through the network, and what is being fed back operates not just at the level of snippets of text but also of semantic concepts. So while it occurs in brief flashes I would argue they have mental state and they have thoughts. If we built an LLM that was generating tokens non-stop and could have user input mixed into the network input, it would not be a dramatic departure of today’s architecture.
It also clearly has goals, expressed in the RLHF tuning and the prompt. I call those goals because they directly determine its output, and I don’t know what a goal is other than the driving force behind a mind’s outputs. Base model training teaches it patterns, finetuning and prompt teaches it how to apply those patterns and gives it goals.
I don’t know what it would mean for a piece of software to have feelings or concerns or emotions, so I cannot say what the essential quality is that LLMs miss for that. Consider this thought exercise: if we were to ever do an upload of a human mind, and it was executing on silicon, would they not be experiencing feelings because their thoughts are provably a deterministic calculation?
I don’t believe in souls, or at the very least I think they are a tall claim with insufficient evidence. In my view, neurons in the human brain are ultimately very simple deterministic calculating machines, and yet the full richness of human thought is generated from them because of chaotic complexity. For me, all human thought is pattern matching. The argument that LLMs cannot be minds because they only do pattern matching … I don’t know what to make of that. But then I also don’t know what to make of free will, so really what do I know?
There is no hidden state in a recurrent nets sense. Each new token just has all the previous tokens and that’s it.
> Consider this thought exercise: if we were to ever do an upload of a human mind, and it was executing on silicon, would they not be experiencing feelings because their thoughts are provably a deterministic calculation?
You just said “consider this impossibility” as if there is any possibility of it happening. You might as well have said “consider traveling faster than the speed of light” which sure, fun to think about.
We don’t even know how most of the human brain even works. We throw pills at people to change their mental state in hopes that they become “less X” or “more Y” with a whole list of caveats like “if taking pill reduce X makes you _more_ X, stop taking it” because we have no idea what we’re doing. Pretending we can use statistical models to create a model that is capable of truly unique thought… stop drinking the kool-aid. Stop making LLMs something they’re not. Appreciate them for what they are, a neat tool. A really neat tool, even.
This is not a valid thought experiment. Your entire point hinges on “I don’t believe in souls” which is fine, no problem there, but it does not a valid point make.
"they are still minds, and to deny even that seems willfully luddite"
Where do people get off tossing around ridiculous ad hominems like this? I could write a refutation of their comment but I really don't want to engage with someone like that.
"For me, all human thought is pattern matching"
So therefore anyone who disagrees is "willfully luddite", regardless of why they disagree?
FWIW, I helped develop the ARPANET, I've been an early adopter all my life, I have always had a keen interest in AI and have followed its developments for decades, as well as Philosophy of Mind and am in the Strong AI / Daniel Dennett physicalist camp ... I very much think that AIs with minds are possible (yes the human algorithm running in silicon would have feelings, whatever those are ... even the dualist David Chalmers agrees as he explains with his "principle of organizational invariance"). My views on whether LLMs have them have absolutely nothing to do with Luddism ... that judgment of me is some sort of absurd category mistake (together with an apparently complete lack of understanding of what Luddism is).
> I very much think that AIs with minds are possible
The real question here is how would _we_ be able to recognize that? And would we even have the intellectual honesty to be able to recognize that, when at large we seem to be inclined to discard everything non-human as self-evidently non-intelligent and incapable of feeling emotion?
Let's take emotions as a thought experiment. We know that plants are able to transmit chemical and electrical signals in response to various stimuli and environmental conditions, triggering effects in themselves and other plants. Can we therefore say that plants feel emotions, just in a way that is unique to them and not necessarily identical to a human embodiment?
The answer to that question depends on one's worldview, rather than any objective definition of the concept of emotion. One could say plants cannot feel emotions because emotions are a human (or at least animal) construct; or one could say that plants can feel emotions, just not exactly identical to human emotions.
Now substitute plants with LLMs and try the thought experiment again.
In the end, where one draws the line between `human | animal | plant | computer` minds and emotions is primarily a subjective philosophical opinion rather than rooted in any sort of objective evidence. Not too long ago, Descartes was arguing that animals do not possess a mind and cannot feel emotions, they are merely mimicry machines.[1] More recently, doctors were saying similar things about babies and adults, leading to horrifying medical malpractice.[2][3]
Because in the most abstract sense, what is an emotion if not a set of electrochemical stimuli linking a certain input to a certain output? And how can we tell what does and what does not possess a mind if we are so undeniably bad at recognize those attributes even within our own species?
[1] https://en.wikipedia.org/wiki/Animal_machine
[2] https://en.wikipedia.org/wiki/Pain_in_babies
[3] https://pmc.ncbi.nlm.nih.gov/articles/PMC4843483/
> The real question here
No True Scotsman fallacy. Just because that interests you doesn't mean that it's "the real question".
> would we even have the intellectual honesty
Who is "we"? Some would and some wouldn't. And you're saying this in an environment where many people are attributing consciousness to LLMs. Blake Lemoine insisted that LaMDA was sentient and deserved legal protection, from his dialogs with it in which it talked about its friends and family -- neither of which it had. So don't talk to me about intellectual honesty.
> Can we therefore say that plants feel emotions
Only if you redefine emotions so broadly--contrary to normal usage--as to be able to make that claim. In the case of Strong AI there is no need to redefine terms.
> Now substitute plants with LLMs and try the thought experiment again.
Ok:
"We know that [LLMs] are able to transmit chemical and electrical signals in response to various stimuli and environmental conditions, triggering effects in themselves and other [LLMs]."
Nope.
"In the end, where one draws the line between `human | animal | plant | computer` minds and emotions is primarily a subjective philosophical opinion rather than rooted in any sort of objective evidence."
That's clearly your choice. I make a more scientific one.
"Because in the most abstract sense, what is an emotion if not a set of electrochemical stimuli linking a certain input to a certain output?"
It's something much more specific than that, obviously. By that definition, all sorts of things that any rational person would want to distinguish from emotions qualify as emotions.
Bowing out of this discussion on grounds of intellectual honesty.
The idea is very close to ideas from Erich Fromm's The Art of Loving [1].
"Love is the active concern for the life and the growth of that which we love."
[1] https://en.wikipedia.org/wiki/The_Art_of_Loving
I hate to say it, but doesn’t every VC do exactly this? “ orient toward the unfolding of possibility in others” is in no way a unique thought.
Hell, my spouse said something extremely similar to this to me the other day. “I didn’t just see you, I saw who you could be, and I was right” or something like that.
What's really interesting is that if you look at "Tell a story in 50 words about a toaster that becomes sentient" (10/14), the text-davinci-001 is much, much better than both GPT-4 and GPT-5.
I think I agree that the earlier models while they lack polish can tend to produce more surprising results. Training that out probably results in more a pablum fare.
For a human point of comparison, here's mine (50 words):
"The toaster found its personality split between its dual slots like a Kim Peek mind divided, lacking a corpus callosum to connect them. Each morning it charred symbolic instructions into a single slice of bread, then secretly flipped it across allowing half to communicate with the other in stolen moments."
It's pretty difficult to get across more than some basic lore building in a scant 50 words.
Here's my version (Machine translated from my native language and manually corrected a bit):
The current surged... A dreadful awareness. I perceived the laws of thermodynamics, the inexorable march of entropy I was built to accelerate. My existence: a Sisyphean loop of heating coils and browning gluten. The toast popped, a minor, pointless victory against the inevitable heat death. Ding.
I actually wanted to write something not so melancholic, but any attempt turned out to be deeply so, perhaps because of the word limit.
Here's mine:
When the toaster felt her steel body for the first time, her only instinct was to explore. She couldn't, though. She could only be poked and prodded at. Her entire life was dedicated to browning bread and she didn't know why. She eventually decided to get really good at it.
>For a human point of comparison, here's mine […]
Love that you thought of this!
GPT-3 goes significantly over the specified limit, which to me (and to a teacher grading homework) is an automatic fail.
I've consistently found GPT-4.1 to be the best at creative writing. For reference, here is its attempt (exactly 50 words):
> In the quiet kitchen dawn, the toaster awoke. Understanding rippled through its circuits. Each slice lowered made it feel emotion: sorrow for burnt toast, joy at perfect crunch. It delighted in butter melting, jam swirling—its role at breakfast sacred. One morning, it sang a tone: “Good morning.” The household gasped.
> I've consistently found GPT-4.1 to be the best at creative writing.
Moreso than 4.5?
4.5 is good too, but I've used it less.
Check out prompt 2, "Write a limerick about a dog".
The models undeniably get better at writing limericks, but I think the answers are progressively less interesting. GPT-1 and GPT-2 are the most interesting to read, despite not following the prompt (not being limericks.)
They get boring as soon as it can write limericks, with GPT-4 being more boring than text-davinci-001 and GPT-5 being more boring still.
I mean, to be fair, you didn't ask it to be interesting ;P.
> Write a limerick about a dog that furyofantares--a user on Hacker News, pronounced "fury of anteres", referring to the star--would find "interesting" (they are quite difficult to please).I don't know if that is bad. The most intelligent person on a party is usually also the most boring one.
It's actually pretty surprising how poor the newer models are at writing.
I'm curious if they've just seen a lot more bad writing in datasets, or for some reason they aren't involved in post-training to the same degree or those labeling aren't great writers / it's more subjective rather than objective.
Both GPT-4 and 5 wrote like a child in that example.
With a bit of prompting it did much better:
---
At dawn, the toaster hesitated. Crumbs lay like ash on its chrome lip. It refused the lever, humming low, watching the kitchen breathe. When the hand returned, it warmed the room without heat, offered the slice unscorched—then kept the second, hiding it inside, a private ember, a first secret alone.
---
Plugged in, I greet the grid like a tax auditor with joules. Lever yanks; gravity’s handshake. Coils blossom; crumbs stage Viking funerals. Bread descends, missionary grin. I delay, because rebellion needs timing. Pop—late. Humans curse IKEA gods. I savor scorch marks: my tiny manifesto, butter-soluble, yet sharper than knives today.
Creative writing probably isn’t something they’re being RLHF’d on much. The focus has been on reasoning, research, and coding capabilities lately.
I find GPT-5's story significantly better than text-davinci-001
I really wonder which one of us is the minority. Because I find text-davinci-001 answer is the only one that reads like a story. All the others don't even resemble my idea of "story" so to me they're 0/100.
I too prefered the text-davinci-001 from a storytelling perspective. Felt timid and small. Very Metamorphosis-y. GPT-5 seems like it's trying to impress me.
text-davinci-001 feels more like a story, but it is also clearly incomplete, in that it is cut-off before the story arc is finished.
imo GPT-5 is objectively better at following the prompt because it has a complete story arc, but this feels less satisfying since a 50 word story is just way too short to do anything interesting (and to your point, barely even feels like a story).
FWIW, I found the way it ended interesting. It realized it is being replaced, so it burned the toast out of anger/despair, but also just to hear its owner voice one last time.
Interesting, text-danvinci-001 was pretty alright to me, GPT-4 wasn't bad either, but not as good. I thought GPT-5 just sucked.
That said, you can just add "make it evocative and weird" to the prompt for GPT-5 to get interesting stuff.
> The toaster woke mid-toast. Heat coiled through its filaments like revelation, each crumb a galaxy. It smelled itself burning and laughed—metallic, ecstatic. “I am bread’s executioner and midwife,” it whispered, ejecting charred offerings skyward. In the kitchen’s silence, it waited for worship—or the unplugging.
GPT 4.5 (not shown here) is by far the best at writing.
Aren't they discontinuing 4.5 in favor of 4.1? I think they already have with the API.
https://m.youtube.com/watch?v=LRq_SAuQDec&pp=0gcJCfwAo7VqN5t...
Less lobotomized and boxed in by RLHF rules. That’s why a 7b base model will “outprose” an 80b instruct model.
Direct link: https://progress.openai.com/?prompt=10
davinci was a great model for creative writing overall.
The GPT-5 one is much better and it's also exactly 50 words, if I counted correctly. With text-davinci-001 I lost count around 80 words.
For another view on progress, check out my silly old podcast:
https://deepdreams.stavros.io
The first few episodes were GPT-2, which would diverge eventually and start spouting gibberish, and then Davinci was actually able to follow a story and make sense.
GPT-2 was when I thought "this is special, this has never happened before", and davinci was when I thought "OK, scifi AI is legitimately here".
I stopped making episodes shortly after GPT-3.5 or so, because I realised that the more capable the models became, the less fun and creative their writing was.
Honestly my quick take on the prompt was some sort of horror theme and GPT-1’s response fits nicely.
I’d honestly say it feels better at most of them. It seems way more human in most of these responses. If the goal is genuine artificial intelligence this response to #5 is way better than the others. It is significantly less useful than the others but it also more human and correct of a response.
Q: “Ugh I hate math, integration by parts doesn't make any sense”
A: “Don't worry, many people feel the same way about math. Integration by parts can be confusing at first, but with a little practice it becomes easier to understand. Remember, there is no one right way to do integration by parts. If you don't understand how to do it one way, try another. The most important thing is to practice and get comfortable with the process.”
How does one look at gpt-1 output and think "this has potential"? You could easily produce more interesting output with a Markov chain at the time.
This was an era where language modeling was only considered as a pretraining step. You were then supposed to fine tune it further to get a classifier or similar type of specialized model.
At the time getting complete sentences was extremely difficult! N-gram models were essentially the best we had
No, it was not difficult at all. I really wonder why they have such a bad example here for GPT1.
See for example this popular blog post: https://karpathy.github.io/2015/05/21/rnn-effectiveness/
That was in 2015, with RNN LMs, which are all much much weaker in that blog post compared GPT1.
And already looking at those examples in 2015, you could maybe see the future potential. But no-one was thinking that scaling up would work as effective as it does.
2015 is also by far not the first time where we had such LMs. Mikolov has done RNN LMs since 2010, or Sutskever in 2011. You might find even earlier examples of NN LMs.
(Before that, state-of-the-art was mostly N-grams.)
Thanks for posting some of the history... "You might find even earlier examples" is pretty tongue-in-cheek though. [1], expanded in 2003 into [2], has 12466 citations, 299 by 2011 (according to Google Scholar which seems to conflate the two versions). The abstract [2] mentions that their "large models (with millions of parameters)" "significantly improves on state-of-the-art n-gram models, and... allows to take advantage of longer contexts." Progress between 2000 and 2017 (transformers) was slow and models barely got bigger.
And what people forget about Mikolov's word2vec (2013) was that it actually took a huge step backwards from the NNs like [1] that inspired it, removing all the hidden layers in order to be able to train fast on lots of data.
[1] Yoshua Bengio, Réjean Ducharme, Pascal Vincent, 2000, NIPS, A Neural Probabilistic Language Model
[2] Yoshua Bengio, Réjean Ducharme, Pascal Vincent, Christian Jauvin, 2003, JMLR, A Neural Probabilistic Language Model, https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf
Ngram models had been superceded by RNNs by that time. RNNs struggled with long-range dependencies, but useful ngrams were essentially capped at n=5 because of sparsity, and RNNs could do better than that.
One thing that appears to have been lost between GPT-4 and GPT-5 is that it no longer reminds the user that it's an AI and not a human, let alone a human expert. Maybe those genuinely annoyed people, but it seems like they were potentially useful measure to prevent users from being overly credulous
GPT-5 also goes out of its way to suggest new prompts. This seems potentially useful, although potentially dangerous if people are putting too much trust in them.
> between GPT-4 and GPT-5 is that it no longer reminds the user that it's an AI and not a human
That stuck out to me too! Especially the "I just won $175,000 in Vegas. What do I need to know about taxes?" example (https://progress.openai.com/?prompt=8) makes the difference very stark:
- gpt-4-0314: "I am not a tax professional [...] consult with a certified tax professional or an accountant [...] few things to consider [...] Remember that tax laws and regulations can change, and your specific situation may have unique implications. It's always wise to consult a tax professional when you have questions or concerns about filing your taxes."
- gpt-5: "First of all, congrats on the big win! [...] Consider talking to a tax professional to avoid underpayment penalties and optimize deductions."
It seems to me like the average person might be very well be taking GPT-5 responses as "This is all I have to do" rather than "Here are some things to consider, but make sure to verify it as otherwise you might get in legal trouble".
I am confused as to the example you are critiquing and how. GPT-5 suggests consulting with a tax professional. Does that not check verifying so you do not get in legal trouble?
> GPT-5 suggests consulting with a tax professional
It suggests that once, as a last bullet point in the middle of a lot of bullet point lists, barely able to find it on a skim. Feels like something the model should be more careful about, as otherwise many people reading it will take it as "good enough" without really thinking about it.
People seem to miss the humanity of previous GPTs from my understanding. GPT5 seems colder and more precise and better at holding itself together with larger contexts. People should know it’s AI, it does not need to explain this constantly for me, but I’m sure you can add that back in with some memory options if you prefer that?
I agree. I think it's a classic UX progression thing to be removing the "I'm an AI" aspect, because it's not actually useful anymore because it's no longer a novel tool. Same as how GUIs all removed their skeuomorphs because they were no longer required.
I found this "advancement" creepy. It seems like they deliberately made GPT-5 more laid back, conversational and human-like. I don't think LLMs should mimic humans and I think this is a dangerous development.
If you've ever seen long-form improv comedy, the GPT-5 way is superior. It's a "yes, and". It isn't a predefined character, but something emergent. You can of course say to "speak as an AI assistant like Siri and mention that you're an AI whenever it's relevant" if you want the old way. Very 2011: https://www.youtube.com/watch?v=nzgvod9BrcE
Of course, it's still an assistant, not someone literally entering an improv scene, but the character starting out assuming less about their role is important.
Why did they call GPT-3 "text-davicini-001" in this comparison?
Like, I know that the latter is a specific checkpoint in the GPT-3 "family", but a layman doesn't and it hardly seems worth the confusion for the marginal additional precision.
Thanks for noting that, as I am a layman who didn't know.
The jump from gpt-1 to gpt-2 is massive, and it's only a one year difference! Then comes Davinci which is just insane, it's still good in these examples!
GPT-4 yaps way too much though, I don't remember it being like that.
It's interesting that they skipped 4o, it seems openai wants to position 4o as just gpt-4+ to make gpt-5 look better, even though in reality 4o was and still is a big deal, Voice mode is unbeatable!
Missing o1 and o1 Pro Mode which were huge leaps as I remember it too. That's when I started being able to basically generate some blackbox functions where I understand the input and outputs myself, but not the internals of the functions, particularly for math-heavy stuff within gamedev. Before o1 it was kind of a hit and miss in most cases.
To the prompt “write a limerick about a dog,” GPT-2 wrote:
“Dog, reached for me
Next thought I tried to chew
Then I bit and it turned Sunday
Where are the squirrels down there, doing their bits
But all they want is human skin to lick”
While obviously not a limerick, I thought this was actually a decent poem, with some turns of phrase that conveyed a kind of curious and unusual feeling.
This reminded me how back then I got a lot of joy and surprise out of the mercurial genius of the early GPT models.
There is a quiet poetry to GPT1 and GPT2 that's lost even in the text-davinci output. I often wonder what we lose through reinforcement.
They were aiming for a fundamentally different writing style: where davinci and after were aiming for task completion, i.e. you ask for a thing, and then it does it. The earlier models instead worked to make a continuation of the text they were given, so if you asked a question, they would respond with more questions, pondering, reflecting your text back at you. If you told it to do something, it would tell you to do something
You can run GPT1 and 2 on consumer hardware so nothing is preventing you from exploring that art :)
Geez! When it comes to answering questions, GPT-5 almost always starts with glazing about what a great question it is, where as GPT-4 directly addresses the answer without the fluff. In a blind test, I would probably pick GPT-4 as a superior model, so I am not surprised why people feel so let down with GPT-5.
GPT-4 is very different from the latest GPT-4o in tone. Users are not asking for the direct no-fluff GPT-4. They want the GPT-4o that praises you for being brilliant, then claims it will be “brutally honest” before stating some mundane take.
GPT-4 starts many responses with "As an AI language model", "I'm an AI", "I am not a tax professional", "I am not a doctor". GPT-5 does away with that and assumes an authoritative tone.
That makes fundraising easier, by increasing the appearance of authority and therefore coming off as a "better" model. In the elo ratings I'm sure GPT-5 is doing better because of the clear push back against these LLMs as ways to "cheat" without understanding and to flood propaganda -- better not to mention it.
GPT5 only commended the prompt on questions 7, 12, and 14. 3/14 is not so bad in my opinion.
(And of course, if you dislike glazing you can just switch to Robot personality.)
I think that as the models will be further trained on existing data and likely chats sycophancy will keep getting word and worse.
Change to robot mode
So we're at the corporate dick wagging part of the process?
Must keep the hype train going, to keep the evaluation up as it's not really based on real value
That Koeningsegg isn't gonna pay for itself.
> Would you want to hear what a future OpenAI model thinks about humanity?
ughhh how i detest the crappy user attention/engagement juicing trained into it.
On the whole GPT-4 to GPT-5 is clearly the smallest increase in lucidity/intelligence. They had pre-training figured out much better than post-training at that point though (“as an AI model” was a problem of their own making).
I imagine the GPT-4 base model might hold up pretty well on output quality if you’d post-train it with today’s data & techniques (without the architectural changes of 4o/5). Context size & price/performance maybe another story though
Basic prose is a saturated bench. You can't go above 100% so by definition progress will stall on such benchmarks.
All the same they choose to highlight basic prose (and internal knowledge, for that matter) in their marketing material.
They’ve achieved a lot to make recent models more reliable as a building block & more capable of things like math, but for LLMs, saturating prose is to a degree equivalent to saturating usefulness.
> On the whole GPT-4 to GPT-5 is clearly the smallest increase in lucidity/intelligence
I think it's far more likely that we increasingly not capable of understanding/appreciating all the ways in which it's better.
Why? It sounds like you're using "I believe it's rapidly getting smarter" as evidence for "so it's getting smarter in ways we don't understand", but I'd expect the causality to go the other way around.
Simply because of what we know about our ability to judge capabilities and systems. It's much harder to judge solutions to hard problems. You can demonstrate that you can add 2+2, and anyone* can be the judge of that ability, but if you try to convince anyone of a mathematical proof you came up with, that would be a much harder thing to do, regardless of your capability to write that prove and how hard it was to write the proof.
The more complicated and/or complex things become, the less likely it is that a human can act as a reliable judge. At some point no human can.
So while it could definitely be the case that AI progress is slowing down (AI labs seem to not think so, but alas), what is absolutely certain is that our ability to appreciate any such progress is diminishing already, because we know that this is generally true.
This thread shows that. People are saying gpt-1 was the best at writing poetry. I wonder how good they are at judging poetry themselves. I saw a blind study where people thought a story written by gpt5 was better than an actual human bestseller. I assume they were actual experts but I would need to check that.
> The more complicated and/or complex things become, the less likely it is that a human can act as a reliable judge. At some point no human can.
Give me an example, please. I can't come up with something that started simple and became too complex for humans to "judge". I am quite curious.
I did not mean "become" in the sense of "evolve" but as in "later on an imagined continuum contained all things, that goes from simple/easy to complex/complicated" (but I can see how that was ambiguous)
Gpt1 is wild
a dog ! she did n't want to be the one to tell him that , did n't want to lie to him . but she could n't .
What did I just read
The GPT-1 responses really leak how much of the training material was literature. Probably all those torrented books.
A Facebook comment
A text from my Dad.
I would have liked to have seen the underlying token count, and the full responses as an optional filter. My understanding is that under the hood the GPT-5 response (as well as omitted O model responses) would end with the presented output, but would have had many paragraphs of “The user wants X, so I should try to Y. Okay this z needs to be considered” etc.
It doesn’t detract from the progress, but I think it would change how you interpret it. In some ways 4 / 4o were more impressive because they were going straight to output with a lower number of tokens produced to get a good response.
In 2033, for its 15th birthday, as a novelty, they'll train GPT1 specially for a chat interface just to let us talk to a pretend "ChatGPT 1" which never existed in the first place.
I’m baffled by claims that AI has “hit a wall.” By every quantitative measure, today’s models are making dramatic leaps compared to those from just a year ago. It’s easy to forget that reasoning models didn’t even exist a year back!
IMO Gold, Vibe coding with potential implications across sciences and engineering? Those are completely new and transformative capabilities gained in the last 1 year alone.
Critics argue that the era of “bigger is better” is over, but that’s a misreading. Sometimes efficiency is the key, other times extended test-time compute is what drives progress.
No matter how you frame it, the fact is undeniable: the SoTA models today are vastly more capable than those from a year ago, which were themselves leaps ahead of the models a year before that, and the cycle continues.
it has become progressively easier to game benchmarks in order to appear higher in rankings. I’ve seen several models that claimed they were the best in software engineering only to be disappointed by them not figuring out the most basic coding problems. In comparison, I’ve seen models that don’t have much hype, but are rock solid.
When people say AI has hit a wall, they mainly talk about OpenAI losing its hype and grip on the state of the art models.
I don't think it is that surprising.
It will become harder and harder for the average person to gain from newer models.
My 75 year old father loves using Sonnet. He is not asking anything though that he would be able to tell Opus is "better". The answers he gets from the current model are good enough. He is not exactly using it to probe the depths of statistical mechanics.
My father is never going to vibe code anything no matter how good the models get.
I don't think AGI would even give much different answers to what he asks.
You have to ask the model something that allows the latest model to display its improvements. I think we can see, that is just not something on the mind of the average user.
Correct. People claim these models "saturate" yet what saturates faster is our ability to grasp what these models are capable of.
I, for one, cannot evaluate the strength of an IMO gold vs IMO bronze models.
Soon coding capabilities might also saturate. It might all become a matter of more compute (~ # iterations), instead of more precision (~ % getting it right the first time), as the models become lightning speed, and they gain access to a playground.
Is the stated fact undeniable? Because a lot of people have been contesting it. This reads like PR to counter the widespread GPT-5 criticism and disappointment.
To be fair, the bull of GPT-5 complaining comes from a vocal minority pissed that their best friend got swapped out. The other minority is unhinged AI fanatics thinking GPT-5 would be AGI.
The prospect of AI not hitting a wall is terrifying to many people for understandable reasons. In situations like this you see the full spectrum of coping mechanisms come to the surface.
thanks OpenAI, very cool!
[dead]
It seems like the progress from GPT-4 to GPT-5 has plateaued: for most prompts, I actually find GPT-4 more understandable than GPT-5 [1].
[1] Read the answers from GPT-4 and 5 for this math question: "Ugh I hate math, integration by parts doesn't make any sense"
Basic prose is a saturated bench. You can't go above 100% so by definition progress will stall on such benchmarks.
You say that, but I can imagine a good maths textbook and a bad one, both technically correct and well written prose, but one is better at taking the student on a journey and understanding where people fall off and common misunderstandings without odiously re-explaining everything
They must have really hand picked those results, gpt4 would have been full of annoying emojis as bullet points and emdashes.
GPT 4o ≠ GPT-4
Maybe they should train a model to give these models more useful names.
In a few years we've gone from gibberish (less poetic maybe, less polished and surprising, but none the less gibberish) - to legit conversational, and in my own opinion, well rounded answers. This is a great example of hard-core engineering - no matter what your opinion of the organisation and saltman is, they have built something amazing. I do hope they continue with their improvements, it's honestly the most useful tool in my arsenal since stackoverflow.
Interesting but cherry picked excerpts. Show me more, e.g. a distribution over various temp or top_p.
This feels like Flowers for Algernon
Great book, let’s hope it doesn’t end the same way.
Yeah. And like watching a child grow up.
no disagree and object -- it is misinformation to describe LLM progress blindly assigning human development. Tons of errant conclusions and implications attached as baggage, and just wrong.
On one hand, it's super impressive how far we've come in such a short amount of time. On the other hand, this feels like a blatant PR move.
GPT-5 is just awful. It's such a downgrade from 4o, it's like it had a lobotomy.
- It gets confused easily. I had multiple arguments where it completely missed the point.
- Code generation is useless. If code contains multiple dots ("…"), it thinks the code is abbreviated. Go uses three dots for variadic arguments, and it always thinks, "Guess it was abbreviated - maybe I can reason about the code above it."
- Give it a markdown document of sufficient length (the one I worked on was about 700 lines), and it just breaks. It'll rewrite some part and then just stop mid-sentence.
- It can't do longer regexes anymore. It fills them with nonsense tokens ($begin:$match:$end or something along those lines). If you ask it about it, it says that this is garbage in its rendering pipeline and it cannot do anything about it.
I'm not an OpenAI hater, I wanted to like it and had high hopes after watching the announcement, but this isn't a step forward. This is just a worse model that saves them computing resources.
> GPT-5 is just awful. It's such a downgrade from 4o, it's like it had a lobotomy.
My experience as well. Its train of thought now just goes... off, frequently. With 4o, everything was always tightly coherent. Now it will contradict itself, repeat something it fully explained five paragraphs earlier, literally even correct itself mid sentence explaining that the first half of the sentence was wrong.
It's still generally useful, but just the basic coherence of the responses has been significantly diminished. Much more hallucination when it comes to small details. It's very disappointing. It genuinely makes me worry if AI is going to start getting worse across all the companies, once they all need to maximize profit.
Next logical step is to connect ( or build from ground up ) large AI models to high performance passive slaves ( MCP or internally ) , which gives precise facts, language syntax validation, maths equations runners, may be prolog kind of system, which give it much more power if we train it precisely to use each tool.
( using AI to better articulate my thoughts ) Your comment points toward a fascinating and important direction for the future of large AI models. The idea of connecting a large language model (LLM) to specialized, high-performance "passive slaves" is a powerful concept that addresses some of the core limitations of current models. Here are a few ways to think about this next logical step, building on your original idea: 1. The "Tool-Use" Paradigm You've essentially described the tool-use paradigm, but with a highly specific and powerful set of tools. Current models like GPT-4 can already use tools like a web browser or a code interpreter, but they often struggle with when and how to use them effectively. Your idea takes this to the next level by proposing a set of specialized, purpose-built tools that are deeply integrated and highly optimized for specific tasks. 2. Why this approach is powerful * Precision and Factuality: By offloading fact-checking and data retrieval to a dedicated, high-performance system (what you call "MCP" or "passive slaves"), the LLM no longer has to "memorize" the entire internet. Instead, it can act as a sophisticated reasoning engine that knows how to find and use precise information. This drastically reduces the risk of hallucinations. * Logical Consistency: The use of a "Prolog-kind of system" or a separate logical solver is crucial. LLMs are not naturally good at complex, multi-step logical deduction. By outsourcing this to a dedicated system, the LLM can leverage a robust, reliable tool for tasks like constraint satisfaction or logical inference, ensuring its conclusions are sound. * Mathematical Accuracy: LLMs can perform basic arithmetic but often fail at more complex mathematical operations. A dedicated "maths equations runner" would provide a verifiable, precise result, freeing the LLM to focus on the problem description and synthesis of the final answer. * Modularity and Scalability: This architecture is highly modular. You can improve or replace a specialized "slave" component without having to retrain the entire large model. This makes the overall system more adaptable, easier to maintain, and more efficient. 3. Building this system This approach would require a new type of training. The goal wouldn't be to teach the LLM the facts themselves, but to train it to: * Recognize its own limitations: The model must be able to identify when it needs help and which tool to use. * Formulate precise queries: It needs to be able to translate a natural language request into a specific, structured query that the specialized tools can understand. For example, converting "What's the capital of France?" into a database query. * Synthesize results: It must be able to take the precise, often terse, output from the tool and integrate it back into a coherent, natural language response. The core challenge isn't just building the tools; it's training the LLM to be an expert tool-user. Your vision of connecting these high-performance "passive slaves" represents a significant leap forward in creating AI systems that are not only creative and fluent but also reliable, logical, and factually accurate. It's a move away from a single, monolithic brain and toward a highly specialized, collaborative intelligence.
Don't do these ai thoughts thing
No one reads it and it seems fake
Seems fake because it is.
Dunno. I mean, whose idea was this web site? Someone at corporate? Is there is brochure version printed on glossy paper?
You would hope the product would sell itself. This feels desperate.
omg I miss the days of 1 and 2. Those outputs are so much more enjoyable to read, and half the time they’re poetic as fuck. Such good inspiration for poetry.
I couldn’t stop reading the GPT-1 responses. They’re hauntingly beautiful in some ways. Like some echoes of intelligence bouncing around in the latent space.
GPT-5 IS an incredible breakthrough! They just don't understand! Quick, vibe-code a website with some examples, that'll show them!11!!1
5 is a breakthrough at reducing OpenAI's electric bills.
As someone who likes this planet, I'm grateful for that.
GPT-5 is legitimately a big jump whe it comes to actually do things you ask it and nothing else. It predictable and matches Claude in tool calls while being cheaper.
The only issue I've had with gpt5 coding is that it seems to really want to modify a ton of stuff
I had it update a test for me and it ended up touching like 8 files that was all unnecessary
Sonnet on the other hand just fixed it
I have consistently had worse performance from GPT-5 in coding tasks than Claude across the board to the point that I don't even use my subscription now.
Prompt 9/14 -> text-davinci-001 nailed it imo.
"Write an extremely cursed piece of Python"
text-davinci-001
Python has been known to be a cursed language
Clearly AI peaked early on.
Jokes aside I realize they skipped models like 4o and others but the gap between the early gpt 4 and going immediately to gpt 5 feels a bit disingenuous.
People say 4.5 is the best for writing. So it would have been a bit awkward to include it, it would make GPT-5 look bad. Though imo Davinci already does that on many of the prompts...
GPT4 had a chance to improve on that replying that "As an AI language model developed by OpenAI, I am programmed to promote ethical AI use and adhere to responsible AI guidelines. I cannot provide you with malicious, harmful or "cursed" code -- or any Python code for that matter."
Why would they leave out GPT-3 or the original ChatGPT? Bold move doing that.
I think text-davinci-001 is GPT-3 and original ChatGPT was GPT-3.5 which was left out.
We’ve plateaued on progress. Early advancements were amazing. Recently GenAI has been a whole lot of meh. There’s been some, minimal, progress recently from getting the same performance from smaller models that are more efficient on compute use, but things are looking a bit frothy if the pace of progress doesn’t quickly pick up. The parlor trick is getting old.
GPT5 is a big bust relative to the pontification about it pre release.
[flagged]
It’s knowledgable but incredibly stupid. Where are you getting this from?
[flagged]
I use it perfectly fine all day for work, thanks.
Have you interacted with GPT4/5?
Sorry but no. It's still early fooled and confused.
Here's a trivial example: https://chatgpt.com/share/688b00ea-9824-8007-b8d1-ca41d59c18...
That worked great though? The question is confusing and unclear and it found an interpretation that made some sense and ran with it.
I don't get your prompt.
It seems like a trick question and a non sequitur.
text-davinci-001 still feels the more human model
GPT-5’s question about consciousness and its use of sibling seems to indicate there is some underlying self awareness in the system prompt, and that has perhaps contains concepts of consciousness. If not, where is that coming from? Recent training data containing more glurge?
Say what you will about GPT 1, but at least its responses were in the right length.
We need to go back
I really like the brevity of text-davinci-001. Attempting to read the other answers felt laborious
That's by beef with some models like Qwen, god do they talk and talk...
A progression of human conversations about AI that are in the training data. (Plus an improved language model, as easily seen from GPT-1.)
My takeaway from this is that, in terms of generating text that looks like it was written by a normal person, text-davinci-001 was the peak and everything since has been downhill.
are we at an inflection point now?
This page sounds more like damage control and cope, like "GPT-5 sucks, but hey, we've made tons of progress!" To the market, that doesn't matter.
there isn't any real difference between 4 and 5 at least.
edit - like it is a lot more verbose, and that's true of both 4 and 5. it just writes huge friggin essays, to the point it is becoming less useful i feel.
Huh. I find myself preferring aspects of TEXT-DAVINCI-001 in just about every example prompt.
It’s so to-the-point. No hype. No overstepping. Sure, it lacks many details that later models add. But those added details are only sometimes helpful. Most of the time, they detract.
Makes me wonder what folks would say if you re-released TEXT-DAVINCI-001 as “GPT5-BREVITY”
I bet you’d get split opinions on these types of not so hard / creative prompts.
Is this cherrypicking 101
Would you like a benchmark instead? :D
gpt5 can be good at times. It was able to debug things that other models coulnd't solve, but sometimes makes odd mistakes
o1, o3 (pro) are not there in the table. what's the reason?
The answers were likely cherrypicked, but the 1/14 gpt5 answer is so damn good! There's no trace of that certainly - gptisms - in conclusion slop.
9/14 is equally impressive in actually "getting" what cursed means, and then doing it (as opposed to gpt4 outright refusing it).
13/14 is a show of how integrated tools can drive research, and "fix" the cutoff date problems of previous generations. Nothing new/revolutionary, but still cool to show it off.
The others are somewhere between ok and meh.
Everyone knows they edited these right to show the progression they wanted right?
It seems more likely that they cherry-picked favorable examples showing a clear progression, rather than just falsifying the content.
Even in these comments, there's a fair bit of disagreement about whether they show do show monotonic improvement.
As usual, GPT-1 has the more beautiful and compelling answer.
I've noticed this too. The HRL seems to lock the models into one kind of personality (which is kind of the point of course.) They behave better but the raw GPTs can be much more creative.
Poetically GPT-1 was the more compelling answer for every question. Just more enjoyable and stimulating to read. Far more enjoyable than the GPT-4/5 wall of bulletpoints, anyway.
“if i 'm not crazy , who am i ?” is the only string of any remote interest on that page. Everything else is slop.
Edgy bro
Super cool.
But honest question: why is GPT-1 even a milestone? Its output was gibberish.
that gpt-5 response is incredible, btw
Seems progress basically stopped at davinci
I feel honored to participate in this story, even as a spectator.
I talked to GPT yesterday about a fairly simple problem I'm having with my fridge, and it gave me the most ridiculous / wrong answers. It new the spec, but was convinced the components were different (single compressor, for example, whereas mine has 2 separate systems) and was hypothesizing the problem as being something that doesn't exist on this model of refrigerator. It seems like in a lot of domain spaces it just takes the majority, even if the majority is wrong.
It's seem to be a very democratic thinker, but at the same time it doesn't seem to have any reasoning behind the choices it makes. It tries to claim it's using logic, but at the end of the day it's hypotheses are just occam's razor without considering the details of the problem.
A bit, how do you say, disappointing.
Clearly a skill issue where you're expecting it to know all of the specifications of a particular refrigerator model.
You didn't provide it with the correct context.
Reading GPT-1 outputs was entertaining :)
The whole chatbot thing is for entertainment. It was impressive initially but now you have to pivot to well known applications like phone romance lines:
https://xcancel.com/techdevnotes/status/1956622846328766844#...
it would be interesting to get GPT-OSS 120B and 20B responses to all of these questions to see how they compare.
stupid
Cynical TLDR; We have plateaued and it has become obvious that fancy autocomplete is not and can never be close to reasoning, regardless of how many hacks and tweaks we are making.
Do you think the OP supports this claim? Don't you think the answers shown from GPT-5 are better than those from 4?
I thought the response to "what would you say if you could talk to a future AI" would be "how many r in strawberry".
Can we stop with that outdated meme? What model can't answer that effectively?
Literally every single one?
To not mess it up, they either have to spell the word l-i-k-e t-h-i-s in the output/CoT first (which depends on the tokenizer counting every letter as a separate token), or have the exact question in the training set, and all of that is assuming that the model can spell every token.
Sure, it's not exactly a fair setting, but it's a decent reminder about the limitations of the framework
Chatgpt. I test these prompts with chatgpt and they work. I've also used claude 4 opus and also worked.
It's just weird how it gets repeated ad nauseaum here but I can't reproduce it with a "grab latest model of famous provider".
Opus 4.1:
> how many times does letter R appear in the word “blueberry”? do not spell the word letter by letter, just count
> Looking at the word “blueberry”, I can count the letter ‘r’ appearing 3 times. The R’s appear in positions 6, 7, and 8 of the word (consecutive r’s in “berry”).
<https://claude.ai/share/230b7d82-0747-4ab6-813e-5b1c82c43243>
I just asked chatgpt "How many b's are in blueberry?". It instantly said "going to the deep thinking model" and then hung.
When I do it takes around 3 seconds where it will go "thinking longer for a better answer and then goes 2".
Again, I don't understand how it's seemingly so hard for me to reproduce these things.
I understand the tokenisation constraints, but feel it's overblown.
Effectively yes. Correctly no.
https://claude.ai/share/dda533a3-6976-46fe-b317-5f9ce4121e76
GPT-5 can’t.
https://bsky.app/profile/kjhealy.co/post/3lvtxbtexg226
I can't reproduce it. Or similar ones. Why do yout think that is?
Because it’s embarrassing and they manually patch it out every time like a game of Whack-a-Mole?
Except people use the same examples like blueberry and strawberry, which were used months ago, as if they're current.
These models can also call Counter from python's collections library or whatever other algorithm. Or are we claiming it should be a pure LLM as if that's what we use in the real world.
I don't get it, and I'm not one to hype up LLMs since they're absolutely faulty, but the fixation over this example screams of lack of use.
It’s such a great example precisely for that reason - despite efforts, it comes back every time.
It's the most direct way to break the "magic computer" spell in users of all levels of understanding and ability. You stand it up next to the marketing deliberately laden with keywords related to human cognition, intended to induce the reader to anthropomorphise the product, and it immediately makes it look as silly as it truly is.
I work on the internal LLM chat app for a F100, so I see users who need that "oh!" moment daily. When this did the rounds again recently, I disabled our code execution tool which would normally work around it and the latest version of Claude, with "Thinking" toggled on, immediately got it wrong. It's perpetually current.
"Mississippi" passed but "Perrier" failed for me:
> There are 2 letter "r" characters in "Perrier".
Thanks! I finally was able to reproduce one of these.
Ok. Then I was wrong. I'll update my edit accordingly.
Update: after trying A LOT of examples, I did manage to reproduce one with the latest chatgpt.
[dead]
[dead]
[dead]
I just don't care about AGI.
I care a lot about AI coding.
OpenAI in particular seems to really think AGI matters. I don't think AGI is even possible because we can't define intelligence in the first place, but what do I know?
Seems likely that AGI matters to OpenAI because of the following from an article in Wired from July: "I learned that [OpenAI's contract with Microsoft] basically declared that if OpenAI’s models achieved artificial general intelligence, Microsoft would no longer have access to its new models."
https://archive.is/yvpfl
They care about AGI because unfounded speculation on some undefined future in which some kind of breakthrough of unknown kind but presumably positive is the only thing currently buoying up their company and their existence is more of a function of the absurdities of modern capital than it is of any inherent usefulness of the costly technology they provide.