*The shift is coming relatively soon thanks to the latest RL breakthroughs (I really encourage to give a look at Will Brown talk). Anthropic and OpenAI are close to nail long multi-task sequences on specialized tasks.
*There are stronger incentives to specialize the model and gate them. They are especially more transformative on the industry side. Right now most of the actual "AI" market is still largely rule-based/ML. Generative AI was not robust enough but now these systems can get disrupted — not to mention many verticals with a big focus on complex yet formal tasks. I know large network engineering co are upscaling their own RL capacities right now.
*Open source AI is distanced so far due to lack of data/frameworks for large scale RL and tasks related data. Though we might see a democratization of verifiers, it will take time.
Several people from big labs reached out since then and confirmed that, despite the obvious uncertainties, this is relatively one point.
Nice and provocative read! Is it fair to restate the argument as follows?
- New tech (eg: RL, cheaper inference) are enabling agentic interactions that fulfill more of the application layer.
- Foundation model companies realize this and are adapting their business models by building complementary UX and witholding API access to integrated models.
- Application layer value props will be squeezed out, disappointing a big chunk of AI investors and complementary infrastructure providers
If so, any thoughts on the following?
- If agentic performance is enabled by models specialized through RL (e.g. Deep Research's o3+browsing), why won't we get open versions of these models that application providers can use?
- Incumbent application providers can put up barriers to agentic access of the data they control. How does their data incumbency and vertical specialization weigh against the relative value of agents built by model providers?
* Well I'm very much involved in making open more models, pretrained the first model on free and open data without copyrigh issues, released the first version fo GRPO that can run on Google Colab (based on Will Brown). Yet, even then I have to be realistic: open source RL has a data issue. We don't have the action sequence data nor the recipes (emulators) that could make it possible to replicate even on a very small scale what big labs are currently working on.
* Agreed on this and I'm seeing this dynamic already in a few areas. Now it's still going to be uphill as some of the data can be bought and advanced pipelines can shortcut some of the need for it, as models can be trained directly on simulated environments.
> We don't have the action sequence data nor the recipes (emulators) that could make it possible to replicate even on a very small scale what big labs are currently working on.
Sounds like an interesting opportunity for application-layer incumbents that want to enable OSS model advancement...
answering the first question if i understand it correctly.
The missing piece is data obviously. With search and code, it's easier to get the data so you get such specialized products. What is likely to happen is: 1/ Many large companies work with some early design partners to develop solutions. They have the data + subject matter expertise, and the design partners bring in the skill. This way we see a new wave of RL agent startups grow. My guess is that this engagement would look different compared to a typical saas engagement. Some companies might do it inhouse, some wont because maintaining such systems is a task. 2/ These companies open source part of their dataset which can be consumed by oss devs to create better agents. This is more common in tech where a path to monopoly is to commoditize the immediately previous layer. Might play out elsewhere too, though I do not have a high degree of confidence here.
* RL is Reinforcement Learning. Already used for a while as part of RLHF but now we have started to find a very nice combo of reasoning+RL on verifiable tasks. Core idea is that models are not just good a predicting the next token but the next right answer.
* I think anything infra with already some ML bundled is especially up for grabs but this will have a more transformative impact than your usual SaaS. Network engineering is a good example: highly formalized but also highly complex. RL models could increasingly nail that.
Respectfully, when you’re responding to someone who doesn't know what RL is, and you say “it’s this—already used in [another even lesser known acronym that includes the original]…” it doesn’t really help asker (like if you know what RLHF is then you know what RL is). I’ll admit I knew what RL was already but I don’t know what RLHF is and the comment just confuses me.
Am I the only one who uses a search engine while reading comment threads about industries/technologies I am not familiar with? This whole conversation is like two searches away from explaining everything (or a two minute conversation with an LLM I suppose)
Am I the only one who uses a search engine while reading comment threads about industries/technologies I am not familiar with?
No. And yet... it's considered a Good Practice to expand acronyms on first use, and generally do things to reduce the friction for your audience to understand what you're writing.
> and generally do things to reduce the friction for your audience to understand what you're writing
Sure, if you're writing a blogpost titled "Architecture for Chefs" then yes, write with that audience in mind.
But we're a mix-match of folks here, from all different walks of life. Requiring that everyone should expand all acronyms others possibly might not understand, would just be a waste of time.
If I see two cooks discussing knives with terms I don't understand, is it really their responsibility to make sure I understand it, although I'm just a passive observer, and I posses the skill to look up things myself?
>But we're a mix-match of folks here, from all different walks of life. Requiring that everyone should expand all acronyms others possibly might not understand, would just be a waste of time.
Exactly!
Why would I waste 5 seconds of my own time, when I could waste 5 seconds of a dozen to hundreds of people's time?
My time is much better spent in meta-discussions, informing people that writing out a word one single time instead of typing up the acronym is too much.
Yes, I searched RLHF and figured it out. But this was an especially “good” example of poor communication. I assume the author isn’t being deliberately obtuse and appreciates the feedback.
This sounds impossible but I would guess RLHF is actually a better known acronym than RL. It became fairly popularly known among tech folks with no AI experience when ChatGPT came out.
Thanks. And what about some more user focused tasks?
I.e. I have small but fairly profitable company that writes specialized software for accountants. Usually it is pretty complex, tax law tends to be changed very often, there are myriads of rules, exemptions etc.
Could this be solved with ML? How long till we get there it at all? How costly this would be?
Disclaimer: I do not write such software. This is just an example.
> So what is happening right now is just a lot of denial. The honeymoon period between model providers and wrappers is over.
> In short the dilemma for most successful wrappers is simple: training or being trained on. What they are doing right now is both free market research for the big labs but, even, as all outputs is ultimately generated through model providers, free data design and generation.
This is a great observation on the current situation. Over the past few years, there's been a proliferation of AI wrappers in the SaaS space; however, because they're use proprietary models, they become entirely dependent on the model providers to continue to offer their solution, there's little to no barrier to entry to create a competing product, and they're providing free training data to the model providers. Instead, as the article suggests, SaaS builders should look into open source models (from places like Github, HuggingFace, or paperswithcode.com) or consider researching their own, and training custom models if they want to offer long-term services to their users.
The argument appears to be "last-mile" specialisation of AI models by massive-compute-companies will be entirely proprietary, and walled-off to prevent competitor extraction of data. And these scenario/domain/task specific models will be the product sold by these companies.
This is plausible insofar as one can find a reason to suppose compute costs for this specialisation will remain very high, and the hardwork of producing relevant data will be done best by those same companies.
I think its equally plausible compute will come down enough, and innovations in "post-training re-training" will occur, that you'll be able to bring this in-house within the enterprise/org. Ie., that "ML/AI Engineer" teams will arise like SEng teams.
Or that there's a limit to statistical modelling over historical cases, that means specailisation is so exponentially demanding on historical case data production, that it cannot practically occur in places which would most benefit from it.
I think the latter is what will prevent the mega players in AI atm making "the model the product" -- at the level they can specialise (ie., given the amount of data needed), so can everyone else.
Perhaps these companies will transition into something SaaS-like, AI-Model-Specialisation-As-A-Service (ASS ASS) -- where they create bespoke models for orgs which can afford it.
The idea is to create bespoke models for org at 90% lower compute. (we cheat a little, where we use an underlying open source model and freeze the existing knowledge). Currently building a specialized model + agent for bioresearch labs. we hope to bring down the costs in long term so that these evolved into continuous learning systems that can be updated everyday. The idea is exactly this: model customization + infra gives you the advantages Prompting + tooling cannot.
> Inference cost are in free fall. The recent optimizations from DeepSeek means that all the available GPUs could cover a demand of 10k tokens per day from a frontier model for… the entire earth population. There is nowhere this level of demand. The economics of selling tokens does not work anymore for model providers: they have to move higher up in the value chain.
I've been using Cline so I can understand the pricing of these models and it's insane how much goes into input context + output. My most recent query on openrouter.ai was 26,098 input tokens -> 147 output tokens. I'm easily burning multiple dollars an hour. Without a doubt there is still demand for cheaper inference.
> I've read a lot of misunderstandings about DeepResearch, which isn't helped by the multiplication of open and closed clones. OpenAI has not built a wrapper on top of O3.
It also doesn't help that they let you select a model varying from 4o-mini to o1-pro for the Deep Research task. But this confirms my suspicion that model selection is irrelevant for the Deep Research tasks and answering follow-up questions.
> Weirdly enough, while Claude 3.7 works perfectly in Claude Code, Cursor struggles with it and I've already seen several high end users cancelling their subscriptions as a result.
It's because Claude Code burns through tokens like there's no tomorrow, meanwhile Cursor attempts to carefully manage token usage and limit what's in context to remain profitable. It's gotten so bad that for any moderately complex task I switch to o1-pro or sonnet-3.7 in the Anthropic Console and max out the thinking tokens. They just released a "MAX" option but I can still tell its nerfed because it thinks for a few seconds whereas I can get up to 2 minutes of thinking via Anthropic Console.
Its abundantly clear that all these model providers are trying to pivot hard into productizing, which is ironic considering that the UX of all these model-as-a-product companies is so universally terrible. Deep Research is a major win, but OpenAI has plenty of fails: Plugins, Custom GPTs, Sora, Search (obsolete now?), Operator are maybe just okay for casual users - not at all a "product".
Anecdotally I noticed this in aider with 3.7 as well; the responses coming back from Claude 3.7 are wayyy more tokens than 3.5(+), and the model is a little less responsive to aider’s prompts. Upshot - it benchmarks better, but is frustrating to use and slower.
Using claude code, it’s clear that Anthropic knows how to get the best out of their model — and, the output spewing is hidden in the interface. I am now using both, depending on task.
When did people ever believe that model selection mattered when using Deep Research? The UI may be bad, but it was obvious from day one that it followed its own workflow.
Search within ChatGPT is far from obsolete. 4o + Search remains a significant advantage in both time and cost when handling real-time, single-step queries—e.g., What is the capital of Texas?
If you have not been reading every OpenAI blog post, you can't be blamed for thinking the model picker affects Deep Research, since the UI heavily implies that.
Single-step queries are far better handled by Kagi/Google search when you care about source quality, discovery and good UX, anything above that it's worth letting Deep Research do its thing in the background. I would go so far as say using Search with 4o you risk getting worse results than just asking the LLM directly - or at least that's been my experience.
I don't disagree, but I think we need to be careful of our vocabulary around the word "model." People are starting to use it to refer to the whole "AI system", rather than the actual transformer model.
This article is talking about models that have been trained specifically for workflow orchestration and tool use. And that is an important development.
But the fundamental architectural pattern isn't different: You run the model in some kind of harness that recognizes tool use invocations, calls to the external tool/rag/codegen/whatever, then feeds the results back into the context window for additional processing.
Architecturally speaking, the harness is a separate thing from the language model. A model can be trained to use Anthropic's MCP, for example, but the capabilities of MCP are not "part" of the model.
A concrete example: A model can't read a webpage without a tool, just like a human can't read a webpage without a computer and web browser.
I just feel like it's important to make a logical distinction between a model and the agentic system using that model. Innovation in both areas is going to proceed along related but different paths.
While I appreciate the distinction you're pointing out, I disagree with your conclusion that the agentic system and its environment will remain separate going forward. There are strong incentives to merge the external environment more closely with the model's environment. I can imagine a future where GPUs have a direct network interface an os-like engine that allows them to interoperate with the external environment more directly.
It seems like a natural line of progress as RL is becoming mainstream for language models; if you can build the verifier into the GPU itself, you can drastically speed up training runs and decrease inference costs.
> Inference cost are in free fall. The recent optimizations from DeepSeek means that all the available GPUs could cover a demand of 10k tokens per day from a frontier model for… the entire earth population. There is nowhere this level of demand. The economics of selling tokens does not work anymore for model providers: they have to move higher up in the value chain.
Even before DeepSeek, the prices were declining by about 90% per year when keeping performance constant. The way to think about economics is different I think. Think of it as any other industry that is on a learning curve like chips, batteries, solar panels, or you name it. The price in these industries keeps falling each year. The winners are the companies that can keep scaling up their production. Think TSMC for example. Nobody can produce high quality chips for a lower price than TSMC due to economies of scale. For instance, one PhD at the company can spend 4 years optimizing a tiny part of the process. But it’s worth it because if it makes the process 0.001% cheaper to run then the PhD paid itself back on the TSMC scale.
So the economics for selling tokens does work. The question is who can keep scaling up long enough so that the rest (has to) give up.
I'm confused by the language here; it seems "model" means different things.
To me a "model" is a static file containing numbers. In front of that file is an inference engine that receives input from a user, runs it through the "model" and outputs the result. That inference engine is a program (not a static file) that can be generic (can run any number of models of the same format, like llama.cpp) or specific/proprietary. This program usually offers an API. "Wrappers" talk to those APIs and therefore, don't do much (they're neither an inference engine, nor a model) -- their specialty is UI.
But in this post it seems the term "model" covers a kind of full package that goes from LLM to UI, including a specific, dedicated inference engine?
If so, the point of the article would be that, because inference is in the process of being commoditized, the industry is moving to vertical integration so as to protect itself and create unique value propositions.
So to clarify: the important product that people will ultimately want is the model. Obviously you need to design an infra/UI around it but that's not the core product.
The really important distinction is between workflow (what everyone use in applied LLM right now) and actual agents. LLM agents can take their own decision, browse online, use tools, etc. without direct supervision as they are directly trained for the task. They internalize all the features of LLM orchestration.
I find the distinction you draw between weights and a program interesting - partially the idea that one is a “static file” and the other isn’t.
What makes a file non-static (dynamic?) other than +x?
Both are instructions about how to perform a computation. Both require other software/hardware/microcode to run. In general, the stack is tall!
Even so, I do agree that “a bunch of matrices” feels different to “a bunch of instructions” - although arguably the former may be closer in architecture to the greatest computing machine we know (the brain) than the latter.
Arguably the distinction between a .guff file and a .guff file with a llama.cpp runner slapped in front of it is negligible. But it does raise an interesting point the article glosses over:
There is a lot happening between a model file sitting on a disk and serving it in an API with attached playground, billing, abuse handling, etc, handling the load of thousands or millions of users calling these incredibly demanding programs. A lot of clever software, good hardware, even down to acquiring buildings and dealing with the order backlog for backup diesel generators.
Improvements in that layer were a large part of what OpenAI to go from the relative obscurity of GPT3.5 to generating massive hype with a ChatGPT anyone could try at a whim. As a more recent example x.ai seems to be struggling with that layer a lot right now. Grok3 is pretty good, but has almost daily partial outages. The 1M context model is promised but never rolls out, instead on some days the served context size is even less than the usual 64k. And they haven't even started making it available on the API.
All of this will be easy when we reach the point where everyone can run powerful LLMs on their own device, but for now just having a 400B parameter model sitting on your hard drive doesn't get your business very far
Yeah, "static" may not be the correct term, and sure, everything is a file. Yet +x makes a big difference. You can't chmod a list of weights and have it "do" anything.
I wouldn't say it is correct. A model is not just a static file containing numbers. Those weights (numbers) you are talking about are absolutely meaningless without the architecture of the model.
The model is the inference engine, a model which can't do inference isn't a model.
> This was the whole message behind the release of GPT-4.5: capacities are growing linearly while compute costs are on a geometric curve
I don't see where that is coming from. Capacities aren't really measureable in that way. Computers either can do something like PdD level mathematics research more or less under their own power or they cannot with a small period of ambiguity as subhuman becomes superhuman. This process seems to me to have been mostly binary with relatively clear tipping points that separate models that can't do something from models that can. That isn't easily mapped back to any sort of growth curve.
Regardless, we're in the stage of the boom where people are patenting clicking a button to purchase goods and services thinking that might be a tricky idea. It isn't clear yet what parts of the product are easy and standard and what parts are difficult and differentiators. People who talk in vague terms will turn out to be correct and specific predictions will be right or wrong at random. It is hard to stress how young all these practical models are. Stable diffusion was released in 2022, and ChatGPT is younger than that - almost yesterday years old; this stuff is early days magic.
I like this as a narrative framework a lot more than the AGI arc. This seems a lot more realistic and less dramatic, more human-focused.
I like the idea of a model being able to create and maintain a full codebase representing the app layer for model-based tools but in practical terms at work and on personal projects I still just don't see it. To get a model to write even a small-scale frontend only app I still have to make functions so atomic and test them to the point where it feels close to the time it would take to write the app manually. And if I ask a model to write larger functions or don't test them / edit them through 3-5 rounds of re-prompting I just end up with code debt that makes the project unrealistic to continue building out beyond a pretty limited MVP stage without going back line by line and basically rewriting the whole thing.
Anyway I'm no power user, curious what other people's experience is. Maybe I'm just using the wrong models.
I think the extrapolation is that the "back and forth" process is what is being improved quickly now. The trade-off is that you don't get a function back but you define a project and AI, with a focused model, will understand and handle all the back and forth until the project works and meets the specifications reasonably well. So it really is looking like a vibe coding future.
I think where things get interesting is that obviously lots of businesses and products won't be built this way, but there will be a lot of reasons to shave off sections of a core business to be "vibe-able". So a new level of rapid MVP will be possible where you can spin up completely functional apps multiple times a day, maybe even dynamically generate them. Which leads to more modular app integrations as a default.
Models might be the product, but data is the material that the products are made out of.
I'm starting to think that if you can control your data, you'll have somewhat of an edge. Which I think could lead to people being more protective of their data. Guess we'll move more and more in the direction of premium paid data streams, while making scraping as hard as possible.
At least in the more niche fields, that work with data that isn't very commonplace and out there for everyone to download.
Kind of sucks for the open source crowds, non-profits, etc. that rely on such data streams.
This is a thoughtful article, but I very much disagree with the author's conclusion. (I'm biased though: I'm a co-creator of OpenHands, fka OpenDevin [1])
To be a bit hyperbolic, this is like saying all SaaS companies are just "compute wrappers", and are dead because AWS and GCP can see all their data and do all the same things.
I like to say LLMs are like engines, and we're tasked with building a car. So much goes into crafting a safe, comfortable, efficient end-user experience, and all that sits outside the core competence of companies that are great at training LLMs.
And there are 1000s of different personas, use cases, and workflows to optimize for. This is not a winner-take-all space.
Furthermore, the models themselves are commoditizing quickly. They can be easily swapped out for one another, so apps built on top of LLMs aren't ever beholden to a single model provider.
I'm super excited to have an ecosystem with thousands of LLM-powered apps. We're already starting to see it materialize, and I'm psyched to be part of it.
Seeing LLM as a motor was a legitimate view until recently. But what we're start seeing with actual agentification is models taking the driver seat, making the call about search, tool use, API. Like DeepSearch, these models are likely to be gated, not even API accessible. It will be even more striking once we'll move to industry specific training — one of the best emerging example is models for network engineering.
The key thing really about my post: it's about the strategy model providers are going to apply in the next 1-2 years. Even the title is coming from an OpenAI slide. Any wrappers will have to operate under this environment.
The only way they’d not be API accessible is surely if they contained some new and extremely difficult to replicate innovation that prevents important capabilities from being commoditised.
What reason or evidence do you see that that is (or will be) the case rather than those features simply representing a temporary lead for some models, which others will all catch up to soon enough?
Yeah, this reminds me of the breathless predictions (and despair, some corners) that flew around shortly after the initial ChatGPT launch. “Oh, they have a lead so vast, no one could ever catch up.” “[Insert X field] is dead.” Et cetera. I didn’t buy it then, and I’m not buying it now.
Of course OpenAI and Anthropic wish they could dominate the application layer. I predicted that two years ago: that model providers would see their technology commoditized, and would turn to using their customers’ data against them to lock them out with competing in-house products. But I don’t think they will succeed, for the reasons rbren mentioned previously. Good application development requires a lot of problem specific knowledge and work, and is not automatable.
On the point of RL — I predict this will generate a some more steam to keep the investment/hype machine cranking a little longer. But the vast majority of tasks are not verifiable. The vast majority have soft success criteria or are mixed, and RL will not overcome the fundamental limitations of GenAI.
I have seen this analogy before, and (hence the question). Apologies if it's rude. By my understanding, while the tools are important, most of the apps hit escape velocity as the underlying models became good enough. You had cursor doing decently well until Claude Sonnet 3.5 came along, and then it took off. As did windsurf. Perplexity and Genspark became 10x more effective with o3-mini and deepseek r1. Plus the switching costs are so low that people switch to apps with the most advanced model.(and ui is very similar to other app) Do you think there is space for apps which can keep improving without improvements to underlying models?
yeah, but engine/car analogy breaks down when it turns out all of the automotive engineering and customer driving data is fed to the engine so they can decide at any point to make your car or your car but better
> To be a bit hyperbolic, this is like saying all SaaS companies are just "compute wrappers", and are dead because AWS and GCP can see all their data and do all the same things.
isn't "we don't train on your data" one of - if not the - the primary enterprise guarantee one pays for when rolling out LLMs for engineers? i don't see a cloud analogy for that
Not having API-s for models would suck terribly. It would kill tools like Aider or Cline, where I can switch models as I prefer, paying only for tokens I have used. The only option would be to purchase overpriced application from a model provider.
I hope the author is wrong and still there will be someone who would like to make money on "selling tokens" not end-to-end closed solutions. But indeed, market surely would seek for added value.
It won't matter if they do. You'll be able to use open source models and host them yourself or use the hoster who realizes there's money to be made there.
Humpty is broken just like when Napster happened and there's no putting him back together.
Exactly, as I've said upthread, DeepSeek + Together.ai, Mistral, Meta + Grok...there are too many opensource + infra teams for this "sealed model" strategy to ever work.
> This is also an uncomfortable direction. All investors have been betting on the application layer. In the next stage of AI evolution, the application layer is likely to be the first to be automated and disrupted.
I'm not convinced, we tend to think in terms of problem-products(solutions), for example editing an image => photoshop, writing some document => word. I doubt that we are going to move to a "Any problem => model". That's what ChatGPT is experimenting with the "calendaring/notification". It breaks the concept that one brand solves one problem. The App store is a good example, there are millions of apps. I find it really hard that the "apps" can get inside the "model" and expect that the model will "generate an app tailored" for that problem at that moment, many new problems will emerge.
> Generalist scaling is stalling. This was the whole message behind the release of GPT-4.5: capacities are growing linearly while compute costs are on a geometric curve. Even with all the efficiency gains in training and infrastructure of the past two years, OpenAI can't deploy this giant model with a remotely affordable pricing.
> Inference cost are in free fall. The recent optimizations from DeepSeek means that all the available GPUs could cover a demand of 10k tokens per day from a frontier model for… the entire earth population. There is nowhere this level of demand. The economics of selling tokens does not work anymore for model providers: they have to move higher up in the value chain.
Wouldn’t the market find a balance then, where the marginal utility of additional computation is aligned with customer value? That fix point could potentially be much higher than where are now in terms of compute.
>Wouldn’t the market find a balance then, where the marginal utility of additional computation is aligned with customer value? That fix point could potentially be much higher than where are now in terms of compute.
I think the author's point here is that the costs are going to continue to fall for inference at an astonishing rate. We're in a situation where the large frontier companies were all consolidated around "inference is computationally expensive", and then DeepSeek - the talented R&D arm of a hedge fund - was able to cut orders of magnitude out of that cost. To me, that hints that nobody was focusing on inference efficiency. It's unlikely that DeepSeek found 100% of the efficiency gains available, so we can expect the cost of inference to continue to be volatile for some time to come.
It's difficult for any market to find equilibrium when price points move around that much.
I don't think those statements are contradictory at all. Making the thing is getting more expensive, but using it is getting cheaper. Electric cars could be a good analogy here, compared to an ICE, the upfront cost is higher, but once you have it, it's cheaper to use.
That doesn’t make sense though if scaling is actually stalling. The reason so much compute goes into training now is scaling, which keeps base model lifetime short.
Comparison I would make is that is like the transition from renting a server to using services on the cloud - you used to rent a server of a specific size and do whatever you want with it, but then the cloud came and undermined that "Swiss army knife" approach by having you rent specific services directly - database storage, processing, etc.
So AI will be more directed and opinionated, but also much easier to use for common tasks. And the "renting a server" option doesn't go away, just becomes less relevant for anyone in the middle of the bell curve.
I just saw a billboard sign for “deepseek on together.ai” on the 101 a couple days ago, and thought “that’s a very good idea”. As long as there are infra players like together.ai and truly open source crack research teams like DeepSeek and Meta (no one can monopolize research(ers)) I don’t think this “gated model” thesis holds. Which means that OpenAI and Anthropic are competing with every other token output wrapper, and somewhat poorly as they aren’t able to offer competing models where theirs may be lacking. Cursor defines this dilemma for the foundational providers.
Cursor uses both Anthropic and OpenAI and benefits from both. Would <FoundationalProvider> be able to build CursorKiller and be competitive with Cursor without <OtherFoundationalProvider>? Without a multiple of current model performance, which don't think is a stretch to say is unlikely, I don't think they can.
I am not so sure this makes sense. Training a model to directly use certain tools (web search etc) makes the model very specialized and less flexible. As long as other solutions are more flexible and less costly, training a specialized model would not be worth it.
In simple terms, performing a relatively simple RL on various tasks is what gives the models the emergent properties like DeepSeek managed to do with multi step reasoning.
The reasoning models and DeepSearch models are essentilly of the same class, but applied on different types of tasks.
The underlying assumption then is that these "specialized" models is the next step in the industry, as the general models will get outperformed (maybe).
Will specialized models also hit a usefulness wall like general models do? (I believe so)
And
Will the model’s blindspots hurt more than the value a model creates? (Much more fuzzy and important)
If so, then even many specialized models will be a commodity and the application on top will still be the product end users will care about.
If not, then we’ll finally see the return on all this AI spending. Tho I think first movers would be at a disadvantage since they need much higher ROI to overcome the insane cost spent on training general models.
> In short, what Claude aims to disrupt and replace the current workflows like this basic "agent" system from llama index: [Figure 1] With this: [Figure 2]
In the 2nd figure, I think we have a viable pattern if you consider "Human" to be part of "Environment". Hypothetically, if one of the available functions to the LLM is something like AskUserQuestion(), you can flip the conversation mode around and have the human serve as a helpful agent during the middle of their own request.
Interesting read. I am curious how you would analyze the current AI hype MCP (model context protocol) from your perspective. Does it fit into the future you see? It seems like it's going the complete opposite direction as the future you paint, but perhaps it's just a stepping stone given the constraints of row.
> an agent has to perform the targeted tasks internally: they "dynamically direct their own processes and tool usage, maintaining control over how they accomplish tasks".
> What most agent startups are currently building is not agents, it's workflows, that is "systems where LLMs and tools are orchestrated through predefined code paths." Workflows may still bring some value
While this viewpoint will likely prove correct in the long run, we are pretty far away from that. Most value in an Enterprise context over the next 3-5 years will come from embedding AI into existing workflows using orchestration techniques, not from fully autonomous agents doing everything end to end "internally".
> Most value in an Enterprise context over the next 3-5 years will come from embedding AI into existing workflows using orchestration techniques...
But this is already happening and it gives no value what-so-ever. Smacking AI on existing workflows just creates bloat. Is anyone using Apple Intelligence, MS Copilot or some Gmail LLM addons?
Agents don't have to be fully autonomous, they just have to work well with humans.
> This is also an uncomfortable direction. All investors have been betting on the application layer. In the next stage of AI evolution, the application layer is likely to be the first to be automated and disrupted.
This is .. surprisingly good tech and strategy analysis for free content on the internet, thank you.
A couple of thoughts — as you note hard infra / training investment has slowed in the last two years. I don’t think this is surprising, although as you say, it may be a market failure. Instead, I’d say it’s circumstance + pattern recognition + SamA’s success.
We had the bulk of model training fundraising done in the last vestiges of ZIRP, at least from funds raised with ZIRP money, and it was clear from OpenAI’s trajectory and financing that it was going to be EXXXPPPENSIVE. There just aren’t that many companies that will slap down $11bn for training and data center buildout — this is out of the scale of Venture finance by any name or concept.
We than had two eras of strategy assessment: first — infrastructure plays can make monopolies. We got (in the US) two “new firm” trial investments here — OpenAI, and ex-OpenAI Anthropic. We also got at least Google working privately.
Then, we had “there is no moat” as an email come out, along with Stanford’s (I believe Alpaca? Precursor to llama) and a surge in interest and knowledge that small datasets pulled out of GPT 3/3.5/(4?) could very efficiently train contender models and small models to start doing tasks.
So, we had a few lucky firms get in while the getting was good for finance, and then we had a spectacularly bad time for new entrants: super high interest rates (comparatively) -> smaller funds -> massive lead by a leader that also weirdly looked like it could be stolen for $5k in API calls -> pattern recognition that our infrastructure period is over for now until there’s some disruption -> no venture finance.
I think we could call out that it’s remarkable, interesting and foresighted that Zuck chose this moment to plow billions into building an open model, and it seems like that may pay off for Meta — it’s a sort of half step ahead of the next gen tech in training know how and iron and a fast follower to Anthropic and OpenAI.
I disagree with your analysis on inference, though. Stepping back a level from the trees of raw tokens available to the forest of “do I have enough inference on what I want inferred at a speed that I want right now?” The answer is absolutely not, by probably two orders of magnitude. With the current rise of using inference to improve training, we’re likely heading into a new era of thinking about how models work and improving them. The end-to-end agent approach you mention is a perfect example. These queries take a long time to generate, in the ten minute range often, from OpenAI. When they’re under a second, Jevon’s paradox seems likely to make me want to issue like ten of them to compare / use as a “meta agent”.. Combined with the massive utility of expanded context and the very real scaling problems with expanding attention into the millions of tokens range, and we have a ways to go here.
> Generalist scaling is stalling. This was the whole message behind the release of GPT-4.5: capacities are growing linearly while compute costs are on a geometric curve. Even with all the efficiency gains in training and infrastructure of the past two years, OpenAI can't deploy this giant model with a remotely affordable pricing.
Hard disagree.
1. "Capacities are growing linearly while compute costs are on a geometric curve" is the very definition of scaling. GPT4.5 continuing this trend is the opposite of stalling: it's the proof that scaling continues to work
2. "OpenAI can't deploy this giant model with a remotely affordable pricing"
WTF? Gpt-4.5 has the same price per token than GPT-4 at release. It seems high compared to other models, but is still dirt cheap compared to human labor. And this model's increased quality means it is the only viable option for some tasks. I have needed proofreading for my book: o1 or o3-mini were not up to the task, but gpt-4.5 really helps. GPT-4.5 is also a leap forward on agentic capabilities. So of course I'll pay for this, it saves me hours by enabling new use-cases
> It's coding model no longer just generating code but managing an entire code base by themselves.
What model is the author talking about? I would pay for that. Is Claude really THIS good, that it could manage the codebase of say, PostgreSQL?
Hi, author here.
An important background is the imminent rise of actual LLM agents I discuss in the next post: https://vintagedata.org/blog/posts/designing-llm-agents
So answering to a few comments:
*The shift is coming relatively soon thanks to the latest RL breakthroughs (I really encourage to give a look at Will Brown talk). Anthropic and OpenAI are close to nail long multi-task sequences on specialized tasks.
*There are stronger incentives to specialize the model and gate them. They are especially more transformative on the industry side. Right now most of the actual "AI" market is still largely rule-based/ML. Generative AI was not robust enough but now these systems can get disrupted — not to mention many verticals with a big focus on complex yet formal tasks. I know large network engineering co are upscaling their own RL capacities right now.
*Open source AI is distanced so far due to lack of data/frameworks for large scale RL and tasks related data. Though we might see a democratization of verifiers, it will take time.
Several people from big labs reached out since then and confirmed that, despite the obvious uncertainties, this is relatively one point.
Nice and provocative read! Is it fair to restate the argument as follows?
- New tech (eg: RL, cheaper inference) are enabling agentic interactions that fulfill more of the application layer.
- Foundation model companies realize this and are adapting their business models by building complementary UX and witholding API access to integrated models.
- Application layer value props will be squeezed out, disappointing a big chunk of AI investors and complementary infrastructure providers
If so, any thoughts on the following?
- If agentic performance is enabled by models specialized through RL (e.g. Deep Research's o3+browsing), why won't we get open versions of these models that application providers can use?
- Incumbent application providers can put up barriers to agentic access of the data they control. How does their data incumbency and vertical specialization weigh against the relative value of agents built by model providers?
Hi. Yes this is wholly correct.
On the second points:
* Well I'm very much involved in making open more models, pretrained the first model on free and open data without copyrigh issues, released the first version fo GRPO that can run on Google Colab (based on Will Brown). Yet, even then I have to be realistic: open source RL has a data issue. We don't have the action sequence data nor the recipes (emulators) that could make it possible to replicate even on a very small scale what big labs are currently working on.
* Agreed on this and I'm seeing this dynamic already in a few areas. Now it's still going to be uphill as some of the data can be bought and advanced pipelines can shortcut some of the need for it, as models can be trained directly on simulated environments.
Thanks for the reply - and for the open AI work!
> We don't have the action sequence data nor the recipes (emulators) that could make it possible to replicate even on a very small scale what big labs are currently working on.
Sounds like an interesting opportunity for application-layer incumbents that want to enable OSS model advancement...
answering the first question if i understand it correctly.
The missing piece is data obviously. With search and code, it's easier to get the data so you get such specialized products. What is likely to happen is: 1/ Many large companies work with some early design partners to develop solutions. They have the data + subject matter expertise, and the design partners bring in the skill. This way we see a new wave of RL agent startups grow. My guess is that this engagement would look different compared to a typical saas engagement. Some companies might do it inhouse, some wont because maintaining such systems is a task. 2/ These companies open source part of their dataset which can be consumed by oss devs to create better agents. This is more common in tech where a path to monopoly is to commoditize the immediately previous layer. Might play out elsewhere too, though I do not have a high degree of confidence here.
Why will application layer value props be squeezed out? And if so, where does value accrue going forward in an RL first world?
Is this the Will Brown talk you are referencing? https://www.youtube.com/watch?v=JIsgyk0Paic
Thanks for linking, yes that is the one he talks about on his blog also.
Hi, interesting article.
Since I am not in the AI industry, I think I do not understand few things:
- what is RL? Research Language?
- does it mean that in essence AI companies will switch to writing enterprise software using LLMs integrated with enterprise tools?
[EDIT] Seems like you can even ask a question on HN because 'how dare you not know something?' and gonna be downvoted.
Hi. So quickly:
* RL is Reinforcement Learning. Already used for a while as part of RLHF but now we have started to find a very nice combo of reasoning+RL on verifiable tasks. Core idea is that models are not just good a predicting the next token but the next right answer.
* I think anything infra with already some ML bundled is especially up for grabs but this will have a more transformative impact than your usual SaaS. Network engineering is a good example: highly formalized but also highly complex. RL models could increasingly nail that.
Respectfully, when you’re responding to someone who doesn't know what RL is, and you say “it’s this—already used in [another even lesser known acronym that includes the original]…” it doesn’t really help asker (like if you know what RLHF is then you know what RL is). I’ll admit I knew what RL was already but I don’t know what RLHF is and the comment just confuses me.
What is RLHF?
Am I the only one who uses a search engine while reading comment threads about industries/technologies I am not familiar with? This whole conversation is like two searches away from explaining everything (or a two minute conversation with an LLM I suppose)
Am I the only one who uses a search engine while reading comment threads about industries/technologies I am not familiar with?
No. And yet... it's considered a Good Practice to expand acronyms on first use, and generally do things to reduce the friction for your audience to understand what you're writing.
> and generally do things to reduce the friction for your audience to understand what you're writing
Sure, if you're writing a blogpost titled "Architecture for Chefs" then yes, write with that audience in mind.
But we're a mix-match of folks here, from all different walks of life. Requiring that everyone should expand all acronyms others possibly might not understand, would just be a waste of time.
If I see two cooks discussing knives with terms I don't understand, is it really their responsibility to make sure I understand it, although I'm just a passive observer, and I posses the skill to look up things myself?
>But we're a mix-match of folks here, from all different walks of life. Requiring that everyone should expand all acronyms others possibly might not understand, would just be a waste of time.
Exactly!
Why would I waste 5 seconds of my own time, when I could waste 5 seconds of a dozen to hundreds of people's time?
My time is much better spent in meta-discussions, informing people that writing out a word one single time instead of typing up the acronym is too much.
That makes for poor communication by increasing the friction to read someone's thoughts.
As an author, you should care about reducing friction and decreasing the cost to the reader.
Some onus is on the reader to educate themselves, particular on Hacker News.
Yes, I searched RLHF and figured it out. But this was an especially “good” example of poor communication. I assume the author isn’t being deliberately obtuse and appreciates the feedback.
This sounds impossible but I would guess RLHF is actually a better known acronym than RL. It became fairly popularly known among tech folks with no AI experience when ChatGPT came out.
Thanks. And what about some more user focused tasks? I.e. I have small but fairly profitable company that writes specialized software for accountants. Usually it is pretty complex, tax law tends to be changed very often, there are myriads of rules, exemptions etc. Could this be solved with ML? How long till we get there it at all? How costly this would be? Disclaimer: I do not write such software. This is just an example.
> So what is happening right now is just a lot of denial. The honeymoon period between model providers and wrappers is over. > In short the dilemma for most successful wrappers is simple: training or being trained on. What they are doing right now is both free market research for the big labs but, even, as all outputs is ultimately generated through model providers, free data design and generation.
This is a great observation on the current situation. Over the past few years, there's been a proliferation of AI wrappers in the SaaS space; however, because they're use proprietary models, they become entirely dependent on the model providers to continue to offer their solution, there's little to no barrier to entry to create a competing product, and they're providing free training data to the model providers. Instead, as the article suggests, SaaS builders should look into open source models (from places like Github, HuggingFace, or paperswithcode.com) or consider researching their own, and training custom models if they want to offer long-term services to their users.
The argument appears to be "last-mile" specialisation of AI models by massive-compute-companies will be entirely proprietary, and walled-off to prevent competitor extraction of data. And these scenario/domain/task specific models will be the product sold by these companies.
This is plausible insofar as one can find a reason to suppose compute costs for this specialisation will remain very high, and the hardwork of producing relevant data will be done best by those same companies.
I think its equally plausible compute will come down enough, and innovations in "post-training re-training" will occur, that you'll be able to bring this in-house within the enterprise/org. Ie., that "ML/AI Engineer" teams will arise like SEng teams.
Or that there's a limit to statistical modelling over historical cases, that means specailisation is so exponentially demanding on historical case data production, that it cannot practically occur in places which would most benefit from it.
I think the latter is what will prevent the mega players in AI atm making "the model the product" -- at the level they can specialise (ie., given the amount of data needed), so can everyone else.
Perhaps these companies will transition into something SaaS-like, AI-Model-Specialisation-As-A-Service (ASS ASS) -- where they create bespoke models for orgs which can afford it.
> AI-Model-Specialisation-As-A-Service (ASS ASS) -- where they create bespoke models for orgs which can afford it.
I think you are on to something here - and this may very well be what these rumored $20k/mon specialized AI "agents" end up being. https://techcrunch.com/2025/03/05/openai-reportedly-plans-to...
We did a research in this regard: https://arxiv.org/abs/2409.17171
The idea is to create bespoke models for org at 90% lower compute. (we cheat a little, where we use an underlying open source model and freeze the existing knowledge). Currently building a specialized model + agent for bioresearch labs. we hope to bring down the costs in long term so that these evolved into continuous learning systems that can be updated everyday. The idea is exactly this: model customization + infra gives you the advantages Prompting + tooling cannot.
> Inference cost are in free fall. The recent optimizations from DeepSeek means that all the available GPUs could cover a demand of 10k tokens per day from a frontier model for… the entire earth population. There is nowhere this level of demand. The economics of selling tokens does not work anymore for model providers: they have to move higher up in the value chain.
I've been using Cline so I can understand the pricing of these models and it's insane how much goes into input context + output. My most recent query on openrouter.ai was 26,098 input tokens -> 147 output tokens. I'm easily burning multiple dollars an hour. Without a doubt there is still demand for cheaper inference.
> I've read a lot of misunderstandings about DeepResearch, which isn't helped by the multiplication of open and closed clones. OpenAI has not built a wrapper on top of O3.
It also doesn't help that they let you select a model varying from 4o-mini to o1-pro for the Deep Research task. But this confirms my suspicion that model selection is irrelevant for the Deep Research tasks and answering follow-up questions.
> Weirdly enough, while Claude 3.7 works perfectly in Claude Code, Cursor struggles with it and I've already seen several high end users cancelling their subscriptions as a result.
It's because Claude Code burns through tokens like there's no tomorrow, meanwhile Cursor attempts to carefully manage token usage and limit what's in context to remain profitable. It's gotten so bad that for any moderately complex task I switch to o1-pro or sonnet-3.7 in the Anthropic Console and max out the thinking tokens. They just released a "MAX" option but I can still tell its nerfed because it thinks for a few seconds whereas I can get up to 2 minutes of thinking via Anthropic Console.
Its abundantly clear that all these model providers are trying to pivot hard into productizing, which is ironic considering that the UX of all these model-as-a-product companies is so universally terrible. Deep Research is a major win, but OpenAI has plenty of fails: Plugins, Custom GPTs, Sora, Search (obsolete now?), Operator are maybe just okay for casual users - not at all a "product".
Anecdotally I noticed this in aider with 3.7 as well; the responses coming back from Claude 3.7 are wayyy more tokens than 3.5(+), and the model is a little less responsive to aider’s prompts. Upshot - it benchmarks better, but is frustrating to use and slower.
Using claude code, it’s clear that Anthropic knows how to get the best out of their model — and, the output spewing is hidden in the interface. I am now using both, depending on task.
When did people ever believe that model selection mattered when using Deep Research? The UI may be bad, but it was obvious from day one that it followed its own workflow.
Search within ChatGPT is far from obsolete. 4o + Search remains a significant advantage in both time and cost when handling real-time, single-step queries—e.g., What is the capital of Texas?
If you have not been reading every OpenAI blog post, you can't be blamed for thinking the model picker affects Deep Research, since the UI heavily implies that.
Hmmm i noticed it after two deep research tasks. No doubt bad UI but surprising folks here were confused for that long.
Single-step queries are far better handled by Kagi/Google search when you care about source quality, discovery and good UX, anything above that it's worth letting Deep Research do its thing in the background. I would go so far as say using Search with 4o you risk getting worse results than just asking the LLM directly - or at least that's been my experience.
YMMV as always but I get the same exact answers between Kagi's Quick Answer and ChatGPT Search, this includes sources.
I don't disagree, but I think we need to be careful of our vocabulary around the word "model." People are starting to use it to refer to the whole "AI system", rather than the actual transformer model.
This article is talking about models that have been trained specifically for workflow orchestration and tool use. And that is an important development.
But the fundamental architectural pattern isn't different: You run the model in some kind of harness that recognizes tool use invocations, calls to the external tool/rag/codegen/whatever, then feeds the results back into the context window for additional processing.
Architecturally speaking, the harness is a separate thing from the language model. A model can be trained to use Anthropic's MCP, for example, but the capabilities of MCP are not "part" of the model.
A concrete example: A model can't read a webpage without a tool, just like a human can't read a webpage without a computer and web browser.
I just feel like it's important to make a logical distinction between a model and the agentic system using that model. Innovation in both areas is going to proceed along related but different paths.
While I appreciate the distinction you're pointing out, I disagree with your conclusion that the agentic system and its environment will remain separate going forward. There are strong incentives to merge the external environment more closely with the model's environment. I can imagine a future where GPUs have a direct network interface an os-like engine that allows them to interoperate with the external environment more directly.
It seems like a natural line of progress as RL is becoming mainstream for language models; if you can build the verifier into the GPU itself, you can drastically speed up training runs and decrease inference costs.
> Inference cost are in free fall. The recent optimizations from DeepSeek means that all the available GPUs could cover a demand of 10k tokens per day from a frontier model for… the entire earth population. There is nowhere this level of demand. The economics of selling tokens does not work anymore for model providers: they have to move higher up in the value chain.
Even before DeepSeek, the prices were declining by about 90% per year when keeping performance constant. The way to think about economics is different I think. Think of it as any other industry that is on a learning curve like chips, batteries, solar panels, or you name it. The price in these industries keeps falling each year. The winners are the companies that can keep scaling up their production. Think TSMC for example. Nobody can produce high quality chips for a lower price than TSMC due to economies of scale. For instance, one PhD at the company can spend 4 years optimizing a tiny part of the process. But it’s worth it because if it makes the process 0.001% cheaper to run then the PhD paid itself back on the TSMC scale.
So the economics for selling tokens does work. The question is who can keep scaling up long enough so that the rest (has to) give up.
I'm confused by the language here; it seems "model" means different things.
To me a "model" is a static file containing numbers. In front of that file is an inference engine that receives input from a user, runs it through the "model" and outputs the result. That inference engine is a program (not a static file) that can be generic (can run any number of models of the same format, like llama.cpp) or specific/proprietary. This program usually offers an API. "Wrappers" talk to those APIs and therefore, don't do much (they're neither an inference engine, nor a model) -- their specialty is UI.
But in this post it seems the term "model" covers a kind of full package that goes from LLM to UI, including a specific, dedicated inference engine?
If so, the point of the article would be that, because inference is in the process of being commoditized, the industry is moving to vertical integration so as to protect itself and create unique value propositions.
Is this interpretation correct?
So to clarify: the important product that people will ultimately want is the model. Obviously you need to design an infra/UI around it but that's not the core product.
The really important distinction is between workflow (what everyone use in applied LLM right now) and actual agents. LLM agents can take their own decision, browse online, use tools, etc. without direct supervision as they are directly trained for the task. They internalize all the features of LLM orchestration.
The expression ultimately comes from a slide from OpenAI from 2023 https://pbs.twimg.com/media/Gly1v0zXIAAGJFz?format=jpg&name=... — so in a way its a long held vision in big labs, just getting more accute now.
I find the distinction you draw between weights and a program interesting - partially the idea that one is a “static file” and the other isn’t.
What makes a file non-static (dynamic?) other than +x?
Both are instructions about how to perform a computation. Both require other software/hardware/microcode to run. In general, the stack is tall!
Even so, I do agree that “a bunch of matrices” feels different to “a bunch of instructions” - although arguably the former may be closer in architecture to the greatest computing machine we know (the brain) than the latter.
</armchair>
Arguably the distinction between a .guff file and a .guff file with a llama.cpp runner slapped in front of it is negligible. But it does raise an interesting point the article glosses over:
There is a lot happening between a model file sitting on a disk and serving it in an API with attached playground, billing, abuse handling, etc, handling the load of thousands or millions of users calling these incredibly demanding programs. A lot of clever software, good hardware, even down to acquiring buildings and dealing with the order backlog for backup diesel generators.
Improvements in that layer were a large part of what OpenAI to go from the relative obscurity of GPT3.5 to generating massive hype with a ChatGPT anyone could try at a whim. As a more recent example x.ai seems to be struggling with that layer a lot right now. Grok3 is pretty good, but has almost daily partial outages. The 1M context model is promised but never rolls out, instead on some days the served context size is even less than the usual 64k. And they haven't even started making it available on the API.
All of this will be easy when we reach the point where everyone can run powerful LLMs on their own device, but for now just having a 400B parameter model sitting on your hard drive doesn't get your business very far
Yeah, "static" may not be the correct term, and sure, everything is a file. Yet +x makes a big difference. You can't chmod a list of weights and have it "do" anything.
I wouldn't say it is correct. A model is not just a static file containing numbers. Those weights (numbers) you are talking about are absolutely meaningless without the architecture of the model.
The model is the inference engine, a model which can't do inference isn't a model.
> This was the whole message behind the release of GPT-4.5: capacities are growing linearly while compute costs are on a geometric curve
I don't see where that is coming from. Capacities aren't really measureable in that way. Computers either can do something like PdD level mathematics research more or less under their own power or they cannot with a small period of ambiguity as subhuman becomes superhuman. This process seems to me to have been mostly binary with relatively clear tipping points that separate models that can't do something from models that can. That isn't easily mapped back to any sort of growth curve.
Regardless, we're in the stage of the boom where people are patenting clicking a button to purchase goods and services thinking that might be a tricky idea. It isn't clear yet what parts of the product are easy and standard and what parts are difficult and differentiators. People who talk in vague terms will turn out to be correct and specific predictions will be right or wrong at random. It is hard to stress how young all these practical models are. Stable diffusion was released in 2022, and ChatGPT is younger than that - almost yesterday years old; this stuff is early days magic.
Models could easily turn out to be a commodity.
I like this as a narrative framework a lot more than the AGI arc. This seems a lot more realistic and less dramatic, more human-focused.
I like the idea of a model being able to create and maintain a full codebase representing the app layer for model-based tools but in practical terms at work and on personal projects I still just don't see it. To get a model to write even a small-scale frontend only app I still have to make functions so atomic and test them to the point where it feels close to the time it would take to write the app manually. And if I ask a model to write larger functions or don't test them / edit them through 3-5 rounds of re-prompting I just end up with code debt that makes the project unrealistic to continue building out beyond a pretty limited MVP stage without going back line by line and basically rewriting the whole thing.
Anyway I'm no power user, curious what other people's experience is. Maybe I'm just using the wrong models.
I think the extrapolation is that the "back and forth" process is what is being improved quickly now. The trade-off is that you don't get a function back but you define a project and AI, with a focused model, will understand and handle all the back and forth until the project works and meets the specifications reasonably well. So it really is looking like a vibe coding future.
I think where things get interesting is that obviously lots of businesses and products won't be built this way, but there will be a lot of reasons to shave off sections of a core business to be "vibe-able". So a new level of rapid MVP will be possible where you can spin up completely functional apps multiple times a day, maybe even dynamically generate them. Which leads to more modular app integrations as a default.
Models might be the product, but data is the material that the products are made out of.
I'm starting to think that if you can control your data, you'll have somewhat of an edge. Which I think could lead to people being more protective of their data. Guess we'll move more and more in the direction of premium paid data streams, while making scraping as hard as possible.
At least in the more niche fields, that work with data that isn't very commonplace and out there for everyone to download.
Kind of sucks for the open source crowds, non-profits, etc. that rely on such data streams.
This is a thoughtful article, but I very much disagree with the author's conclusion. (I'm biased though: I'm a co-creator of OpenHands, fka OpenDevin [1])
To be a bit hyperbolic, this is like saying all SaaS companies are just "compute wrappers", and are dead because AWS and GCP can see all their data and do all the same things.
I like to say LLMs are like engines, and we're tasked with building a car. So much goes into crafting a safe, comfortable, efficient end-user experience, and all that sits outside the core competence of companies that are great at training LLMs.
And there are 1000s of different personas, use cases, and workflows to optimize for. This is not a winner-take-all space.
Furthermore, the models themselves are commoditizing quickly. They can be easily swapped out for one another, so apps built on top of LLMs aren't ever beholden to a single model provider.
I'm super excited to have an ecosystem with thousands of LLM-powered apps. We're already starting to see it materialize, and I'm psyched to be part of it.
[1] https://github.com/All-Hands-AI/OpenHands
Seeing LLM as a motor was a legitimate view until recently. But what we're start seeing with actual agentification is models taking the driver seat, making the call about search, tool use, API. Like DeepSearch, these models are likely to be gated, not even API accessible. It will be even more striking once we'll move to industry specific training — one of the best emerging example is models for network engineering.
The key thing really about my post: it's about the strategy model providers are going to apply in the next 1-2 years. Even the title is coming from an OpenAI slide. Any wrappers will have to operate under this environment.
The only way they’d not be API accessible is surely if they contained some new and extremely difficult to replicate innovation that prevents important capabilities from being commoditised.
What reason or evidence do you see that that is (or will be) the case rather than those features simply representing a temporary lead for some models, which others will all catch up to soon enough?
Yeah, this reminds me of the breathless predictions (and despair, some corners) that flew around shortly after the initial ChatGPT launch. “Oh, they have a lead so vast, no one could ever catch up.” “[Insert X field] is dead.” Et cetera. I didn’t buy it then, and I’m not buying it now.
Of course OpenAI and Anthropic wish they could dominate the application layer. I predicted that two years ago: that model providers would see their technology commoditized, and would turn to using their customers’ data against them to lock them out with competing in-house products. But I don’t think they will succeed, for the reasons rbren mentioned previously. Good application development requires a lot of problem specific knowledge and work, and is not automatable.
On the point of RL — I predict this will generate a some more steam to keep the investment/hype machine cranking a little longer. But the vast majority of tasks are not verifiable. The vast majority have soft success criteria or are mixed, and RL will not overcome the fundamental limitations of GenAI.
I have seen this analogy before, and (hence the question). Apologies if it's rude. By my understanding, while the tools are important, most of the apps hit escape velocity as the underlying models became good enough. You had cursor doing decently well until Claude Sonnet 3.5 came along, and then it took off. As did windsurf. Perplexity and Genspark became 10x more effective with o3-mini and deepseek r1. Plus the switching costs are so low that people switch to apps with the most advanced model.(and ui is very similar to other app) Do you think there is space for apps which can keep improving without improvements to underlying models?
yeah, but engine/car analogy breaks down when it turns out all of the automotive engineering and customer driving data is fed to the engine so they can decide at any point to make your car or your car but better
> To be a bit hyperbolic, this is like saying all SaaS companies are just "compute wrappers", and are dead because AWS and GCP can see all their data and do all the same things.
isn't "we don't train on your data" one of - if not the - the primary enterprise guarantee one pays for when rolling out LLMs for engineers? i don't see a cloud analogy for that
Not having API-s for models would suck terribly. It would kill tools like Aider or Cline, where I can switch models as I prefer, paying only for tokens I have used. The only option would be to purchase overpriced application from a model provider.
I hope the author is wrong and still there will be someone who would like to make money on "selling tokens" not end-to-end closed solutions. But indeed, market surely would seek for added value.
It won't matter if they do. You'll be able to use open source models and host them yourself or use the hoster who realizes there's money to be made there.
Humpty is broken just like when Napster happened and there's no putting him back together.
Exactly, as I've said upthread, DeepSeek + Together.ai, Mistral, Meta + Grok...there are too many opensource + infra teams for this "sealed model" strategy to ever work.
> This is also an uncomfortable direction. All investors have been betting on the application layer. In the next stage of AI evolution, the application layer is likely to be the first to be automated and disrupted.
I'm not convinced, we tend to think in terms of problem-products(solutions), for example editing an image => photoshop, writing some document => word. I doubt that we are going to move to a "Any problem => model". That's what ChatGPT is experimenting with the "calendaring/notification". It breaks the concept that one brand solves one problem. The App store is a good example, there are millions of apps. I find it really hard that the "apps" can get inside the "model" and expect that the model will "generate an app tailored" for that problem at that moment, many new problems will emerge.
These two trends seem somewhat contradictory:
> Generalist scaling is stalling. This was the whole message behind the release of GPT-4.5: capacities are growing linearly while compute costs are on a geometric curve. Even with all the efficiency gains in training and infrastructure of the past two years, OpenAI can't deploy this giant model with a remotely affordable pricing.
> Inference cost are in free fall. The recent optimizations from DeepSeek means that all the available GPUs could cover a demand of 10k tokens per day from a frontier model for… the entire earth population. There is nowhere this level of demand. The economics of selling tokens does not work anymore for model providers: they have to move higher up in the value chain.
Wouldn’t the market find a balance then, where the marginal utility of additional computation is aligned with customer value? That fix point could potentially be much higher than where are now in terms of compute.
>Wouldn’t the market find a balance then, where the marginal utility of additional computation is aligned with customer value? That fix point could potentially be much higher than where are now in terms of compute.
I think the author's point here is that the costs are going to continue to fall for inference at an astonishing rate. We're in a situation where the large frontier companies were all consolidated around "inference is computationally expensive", and then DeepSeek - the talented R&D arm of a hedge fund - was able to cut orders of magnitude out of that cost. To me, that hints that nobody was focusing on inference efficiency. It's unlikely that DeepSeek found 100% of the efficiency gains available, so we can expect the cost of inference to continue to be volatile for some time to come.
It's difficult for any market to find equilibrium when price points move around that much.
I don't think those statements are contradictory at all. Making the thing is getting more expensive, but using it is getting cheaper. Electric cars could be a good analogy here, compared to an ICE, the upfront cost is higher, but once you have it, it's cheaper to use.
That doesn’t make sense though if scaling is actually stalling. The reason so much compute goes into training now is scaling, which keeps base model lifetime short.
Comparison I would make is that is like the transition from renting a server to using services on the cloud - you used to rent a server of a specific size and do whatever you want with it, but then the cloud came and undermined that "Swiss army knife" approach by having you rent specific services directly - database storage, processing, etc.
So AI will be more directed and opinionated, but also much easier to use for common tasks. And the "renting a server" option doesn't go away, just becomes less relevant for anyone in the middle of the bell curve.
I just saw a billboard sign for “deepseek on together.ai” on the 101 a couple days ago, and thought “that’s a very good idea”. As long as there are infra players like together.ai and truly open source crack research teams like DeepSeek and Meta (no one can monopolize research(ers)) I don’t think this “gated model” thesis holds. Which means that OpenAI and Anthropic are competing with every other token output wrapper, and somewhat poorly as they aren’t able to offer competing models where theirs may be lacking. Cursor defines this dilemma for the foundational providers.
I think I understand this comment but can you expand on how cursor defines this dilemma?
Cursor uses both Anthropic and OpenAI and benefits from both. Would <FoundationalProvider> be able to build CursorKiller and be competitive with Cursor without <OtherFoundationalProvider>? Without a multiple of current model performance, which don't think is a stretch to say is unlikely, I don't think they can.
I agree - and also the fact that cursor even exists while Microsoft had already built copilot.
I am not so sure this makes sense. Training a model to directly use certain tools (web search etc) makes the model very specialized and less flexible. As long as other solutions are more flexible and less costly, training a specialized model would not be worth it.
This is an incredibly valuable analysis.
In simple terms, performing a relatively simple RL on various tasks is what gives the models the emergent properties like DeepSeek managed to do with multi step reasoning.
The reasoning models and DeepSearch models are essentilly of the same class, but applied on different types of tasks.
The underlying assumption then is that these "specialized" models is the next step in the industry, as the general models will get outperformed (maybe).
The real questions wrt the title is:
Will specialized models also hit a usefulness wall like general models do? (I believe so)
And
Will the model’s blindspots hurt more than the value a model creates? (Much more fuzzy and important)
If so, then even many specialized models will be a commodity and the application on top will still be the product end users will care about.
If not, then we’ll finally see the return on all this AI spending. Tho I think first movers would be at a disadvantage since they need much higher ROI to overcome the insane cost spent on training general models.
> In short, what Claude aims to disrupt and replace the current workflows like this basic "agent" system from llama index: [Figure 1] With this: [Figure 2]
In the 2nd figure, I think we have a viable pattern if you consider "Human" to be part of "Environment". Hypothetically, if one of the available functions to the LLM is something like AskUserQuestion(), you can flip the conversation mode around and have the human serve as a helpful agent during the middle of their own request.
Interesting read. I am curious how you would analyze the current AI hype MCP (model context protocol) from your perspective. Does it fit into the future you see? It seems like it's going the complete opposite direction as the future you paint, but perhaps it's just a stepping stone given the constraints of row.
> an agent has to perform the targeted tasks internally: they "dynamically direct their own processes and tool usage, maintaining control over how they accomplish tasks".
> What most agent startups are currently building is not agents, it's workflows, that is "systems where LLMs and tools are orchestrated through predefined code paths." Workflows may still bring some value
While this viewpoint will likely prove correct in the long run, we are pretty far away from that. Most value in an Enterprise context over the next 3-5 years will come from embedding AI into existing workflows using orchestration techniques, not from fully autonomous agents doing everything end to end "internally".
> Most value in an Enterprise context over the next 3-5 years will come from embedding AI into existing workflows using orchestration techniques...
But this is already happening and it gives no value what-so-ever. Smacking AI on existing workflows just creates bloat. Is anyone using Apple Intelligence, MS Copilot or some Gmail LLM addons?
Agents don't have to be fully autonomous, they just have to work well with humans.
> This is also an uncomfortable direction. All investors have been betting on the application layer. In the next stage of AI evolution, the application layer is likely to be the first to be automated and disrupted.
Highly agree with the sentiments expressed in this post, I wrote about something similar in my blog post on "Artificial General Software": https://www.markfayngersh.com/posts/artificial-general-softw...
This is .. surprisingly good tech and strategy analysis for free content on the internet, thank you.
A couple of thoughts — as you note hard infra / training investment has slowed in the last two years. I don’t think this is surprising, although as you say, it may be a market failure. Instead, I’d say it’s circumstance + pattern recognition + SamA’s success.
We had the bulk of model training fundraising done in the last vestiges of ZIRP, at least from funds raised with ZIRP money, and it was clear from OpenAI’s trajectory and financing that it was going to be EXXXPPPENSIVE. There just aren’t that many companies that will slap down $11bn for training and data center buildout — this is out of the scale of Venture finance by any name or concept.
We than had two eras of strategy assessment: first — infrastructure plays can make monopolies. We got (in the US) two “new firm” trial investments here — OpenAI, and ex-OpenAI Anthropic. We also got at least Google working privately.
Then, we had “there is no moat” as an email come out, along with Stanford’s (I believe Alpaca? Precursor to llama) and a surge in interest and knowledge that small datasets pulled out of GPT 3/3.5/(4?) could very efficiently train contender models and small models to start doing tasks.
So, we had a few lucky firms get in while the getting was good for finance, and then we had a spectacularly bad time for new entrants: super high interest rates (comparatively) -> smaller funds -> massive lead by a leader that also weirdly looked like it could be stolen for $5k in API calls -> pattern recognition that our infrastructure period is over for now until there’s some disruption -> no venture finance.
I think we could call out that it’s remarkable, interesting and foresighted that Zuck chose this moment to plow billions into building an open model, and it seems like that may pay off for Meta — it’s a sort of half step ahead of the next gen tech in training know how and iron and a fast follower to Anthropic and OpenAI.
I disagree with your analysis on inference, though. Stepping back a level from the trees of raw tokens available to the forest of “do I have enough inference on what I want inferred at a speed that I want right now?” The answer is absolutely not, by probably two orders of magnitude. With the current rise of using inference to improve training, we’re likely heading into a new era of thinking about how models work and improving them. The end-to-end agent approach you mention is a perfect example. These queries take a long time to generate, in the ten minute range often, from OpenAI. When they’re under a second, Jevon’s paradox seems likely to make me want to issue like ten of them to compare / use as a “meta agent”.. Combined with the massive utility of expanded context and the very real scaling problems with expanding attention into the millions of tokens range, and we have a ways to go here.
Thanks again, appreciated the analysis!
> Generalist scaling is stalling. This was the whole message behind the release of GPT-4.5: capacities are growing linearly while compute costs are on a geometric curve. Even with all the efficiency gains in training and infrastructure of the past two years, OpenAI can't deploy this giant model with a remotely affordable pricing.
Hard disagree. 1. "Capacities are growing linearly while compute costs are on a geometric curve" is the very definition of scaling. GPT4.5 continuing this trend is the opposite of stalling: it's the proof that scaling continues to work 2. "OpenAI can't deploy this giant model with a remotely affordable pricing" WTF? Gpt-4.5 has the same price per token than GPT-4 at release. It seems high compared to other models, but is still dirt cheap compared to human labor. And this model's increased quality means it is the only viable option for some tasks. I have needed proofreading for my book: o1 or o3-mini were not up to the task, but gpt-4.5 really helps. GPT-4.5 is also a leap forward on agentic capabilities. So of course I'll pay for this, it saves me hours by enabling new use-cases
Nah.
The model is the talent. A talented model is good, but you need to know how to use it.
Out of curiosity, is it common to pay to use “AI” in 2025?
I have no desire to pay for any of these “products” even a little bit.
you're the minority here
Given the number of paying subscribers versus total users of ChatGPT, this isn't the case.
I mean on hacker news
I was tempted to pay to get around rate-limits for Claude, until I found out that paying subscribers are also severely rate-limited.