Directionally I think this is right. Most LLM usage at scale tends to be filling the gaps between two hardened interfaces. The reliability comes not from the LLM inference and generation but the interfaces themselves only allowing certain configuration to work with them.
LLM output is often coerced back into something more deterministic such as types, or DB primary keys. The value of the LLM is determined by how well your existing code and tools model the data, logic, and actions of your domain.
In some ways I view LLMs today a bit like 3D printers, both in terms of hype and in terms of utility. They excel at quickly connecting parts similar to rapid prototyping with 3d printing parts. For reliability and scale you want either the LLM or an engineer to replace the printed/inferred connector with something durable and deterministic (metal/code) that is cheap and fast to run at scale.
Additionally, there was a minute during the 3D printer Gardner hype cycle where there were notions that we would all just print substantial amounts of consumer goods when the reality is the high utility use case are much more narrow. There is a corollary here to LLM usage. While LLMs are extremely useful we cannot rely on LLMs to generate or infer our entire operational reality or even engage meaningfully with it without some sort of pre-existing digital modeling as an anchor.
Hype cycle for drones and VR was similar -- at the peak, you have people claiming drones will take over package delivery and everyone will spend their day in VR. Reality is that the applicability is more narrow.
Interesting take but too bearish on LLMs in my opinion.
LLMs have already found large-scale usage (deep research, translation) which makes them more ubiquitous today than 3D printers ever will or could have been.
Something I've realized about LLM tool use is that it means that if you can reduce a problem to something that can be solved by an LLM in a sandbox using tools in a loop, you can brute force that problem.
The job then becomes identifying those problems and figuring out how to configure a sandbox for them, what tools to provide and how to define the success criteria for the model.
That still takes significant skill and experience, but it's at a higher level than chewing through that problem using trial and error by hand.
> The job then becomes identifying those problems and figuring out how to configure a sandbox for them, what tools to provide, and how to define the success criteria for the model.
Your test case seems like a quintessential example where you're missing that last step.
Since it is unlikely that you understand the math behind fractals or x86 assembly (apologies if I'm wrong on this), your only means for verifying the accuracy of your solution is a superficial visual inspection, e.g. "Does it look like the Mandelbrot series?"
Ideally, your evaluation criteria would be expressed as a continuous function, but at the very least, it should take the form of a sufficiently diverse quantifiable set of discrete inputs and their expected outputs.
I’ve been thinking about using LLMs for brute forcing problems too.
Like LLMs kinda suck at typescript generics. They’re surprisingly bad at them. Probably because it’s easy to write generics that look correct, but are then screwy in many scenarios. Which is also why generics are hard for humans.
If you could have any LLM actually use TSC, it could run tests, make sure things are inferring correctly, etc. it could just keep trying until it works. I’m not sure this is a way to produce understandable or maintainable generics, but it would be pretty neat.
Also while typing this is realized that cursor can see typescript errors. All I need are some utility testing types, and I could have cursor write the tests and then brute force the problem!
If I ever actually do this I’ll update this comment lol
Giving LLMs the right context -- eg in the form of predefined "cognitive tools", as explored with a ton of rigor
here^1 -- seems like the way forward, at least to this casual observer.
> LLM in a sandbox using tools in a loop, you can brute force that problem
Does this require using big models through their APIs and spending a lot of tokens?
Or can this be done either with local models (probably very slow), or with subscriptions like Claude Code with Pro (without hitting the rate/usage limits)?
I saw the Mandelbrot experiment, it was very cool, but still a rather small project, not really comparable to a complex/bigger/older code base for a platform used in production
I've been using a VM for a sandbox, just to make sure it won't delete my files if it goes insane.
With some host data directories mounted read only inside the VM.
This creates some friction though. Feels like a tool which runs the AI agent in a VM, but then copies it's output to the host machine after some checks would help, so that it would feel that you are running it natively on the host.
One of my biggest, ongoing challenges has been to get the LLM to use the tool(s) that are appropriate for the job. It feels like teach your kids to say, do laundry and you want to just tell them to step aside and let you do it.
> try completing a GitHub task with the GitHub MCP, then repeat it with the gh CLI tool. You'll almost certainly find the latter uses context far more efficiently and you get to your intended results quicker.
This is spot on. I have a "devops" folder with a CLAUDE.md with bash commands for common tasks (e.g. find prod / staging logs with this integration ID).
When I complete a novel task (e.g. count all the rows that were synced from stripe to duckdb) I tell Claude to update CLAUDE.md with the example. The next time I ask a similar question, Claude one-shots it.
This is the first few lines of the CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
## Purpose
This devops folder is dedicated to Google Cloud Platform (GCP) operations, focusing on:
- Google Cloud Composer (Airflow) DAG management and monitoring
- Google Cloud Logging queries and analysis
- Kubernetes cluster management (GKE)
- Cloud Run service debugging
## Common DevOps Commands
### Google Cloud Composer
```bash
# View Composer environment details
gcloud composer environments describe meltano --location us-central1 --project definite-some-id
# List DAGs in the environment
gcloud composer environments storage dags list --environment meltano --location us-central1 --project definite-some-id
# View DAG runs
gcloud composer environments run meltano --location us-central1 dags list
# Check Airflow logs
gcloud logging read 'resource.type="cloud_composer_environment" AND resource.labels.environment_name="meltano"' --project definite-some-id --limit 50
I feel like I'm taking crazy pills sometimes. You have a file with a set of snippets and you prefer to ask the AI to hopefully run them instead of just running it yourself?
Just as a related aside, you could literally make that bottom section into a super simple stdio MCP Server and attach that to Claude Code. Each of your operations could be a tool and have a well-defined schema for parameters. Then you are giving the LLM a more structured and defined way to access your custom commands. I'm pretty positive there are even pre-made MCP Servers that are designed for just this activity.
I use a similar file, but just for myself (I've never used an LLM "agent"). I live in Emacs, but this is the only thing I use org-mode for: it lets me fold/unfold the sections, and I can press C-c C-c over any of the code snippets to execute it. Some of them are shell code, some of them are Emacs Lisp code which generates shell code, etc.
I do something similar, but the problem is that claude.md keeps on growing.
To tackle this, I converted a custom prompt into an application, but there is an interesting trade-off. The application is deterministic. It cannot deal with unknown situations. In contrast to CC, which is way slower, but can try alternative ways of dealing with an unknown situation.
I ended up with adding an instruction to the custom command to run the application and fix the application code (TDD) if there is a problem. Self healing software… who ever thought
You're letting the LLM execute privileged API calls against your production/test/staging environment, just hoping it won't corrupt something, like truncate logs, files, databases etc?
Or are you asking it to provide example commands that you can sanity check?
I'd be curious to see some more concrete examples.
I have the feeling that's not really MCP specifically VS other ways, but it is pretty simply: at the current state of AI, to have a human in the loop is much better. LLMs are great at certain tasks but they often get trapped into local minima, if you do the back and forth via the web interface of LLMs, ask it to write a program, look at it, provide hints to improve it, test it, ..., you get much better results and you don't cut yourself out to find a 10k lines of code mess that could be 400 lines of clear code. That's the current state of affairs, but of course many will try very hard to replace programmers that is currently not possible. What it is possible is to accelerate the work of a programmer several times (but they must be good both at programming and LLM usage), or take a smart person that has a relatively low skill in some technology, and thanks to LLM make this person productive in this field without the long training otherwise needed. And many other things. But "agentic coding" right now does not work well. This will change, but right now the real gain is to use the LLM as a colleague.
It is not MCP: it is autonomous agents that don't get feedbacks from smart humans.
So I run my own business (product), I code everything, and I use claude-code. I also wear all the other hats and so I'd be happy to let Claude handle all of the coding if / when it can. I can confirm we're certainly not there yet.
It's definitely useful, but you have to read everything. I'm working in a type-safe functional compiled language too. I'd be scared to try this flow in a less "correctness enforced" language.
That being said, I do find that it works well. It's not living up to the hype, but most of that hype was obvious nonsense. It continues to surprise me with it's grasp on concepts and is definitely saving me some time, and more importantly making some larger tasks more approachable since I can split my time better.
My absolute favorite use of MCP so far is Bruce Hauman's clojure-mcp. In short, it gives the LLM (a) a bash tool, (b) a persistent Clojure REPL, and (c) structural editing tools.
The effect is that it's far more efficient at editing Clojure code than any purely string-diff-based approach, and if you write a good test suite it can rapidly iterate back and forth just editing files, reloading them, and then re-running the test suite at the REPL -- just like I would. It's pretty incredible to watch.
I was just going to comment about clojure-mcp!! It’s far and away the coolest use of mcp I’ve seen so far.
It can straight up debug your code, eval individual expressions, document return types of functions. It’s amazing.
It actually makes me think that languages with strong REPLs are a better for languages than those without. Seeing clojure-mcp do its thing is the most impressive AI feat I’ve seen since I saw GPT-3 in action for the first time
I think the GitHub CLI example isn't entirely fair to MCP. Yes, GitHub's CLI is extensively documented online, so of course LLMs will excel at generating code for well-known tools. But MCP shines in different scenarios.
Consider internal company tools or niche APIs with minimal online documentation. Sure, you could dump all the documentation into context for code generation, but that often requires more context than interacting with an MCP tool. More importantly, generated code for unfamiliar APIs is prone to errors so you'd need robust testing and retry mechanisms built in to the process.
With MCP, if the tools are properly designed and receive correct inputs, they work reliably. The LLM doesn't need to figure out API intricacies, authentication flows, or handle edge cases - that's already handled by the MCP server.
So I agree MCP for GitHub is probably overkill but there are many legitimate use cases where pre-built MCP tools make more sense than asking an LLM to reverse-engineer poorly documented or proprietary systems from scratch.
> Sure, you could dump all the documentation into context for code generation, but that often requires more context than interacting with an MCP tool.
MCP works exactly that way: you dump documentation into the context. That's how the LLM knows how to call your tool. Even for custom stuff I noticed that giving the LLM things to work with that it knows (eg: python, javascript, bash) beats it using MCP tool calling, and in some ways it wastes less context.
YMMV, but I found the limit of tools available to be <15 with sonnet4. That's a super low amount. Basically the official playwright MCP alone is enough to fully exhaust your available tool space.
That's handled by the MCP server in the sense of it doesn't do authentication, etc. it provides a simplified view of the world.
If that's what you wanted you could have designed that as your poorly documented internal API differently to begin with. There's zero advantage to MCP in the scenario you describe aside from convincing people that their original API is too hard to use.
Regarding the Playwright example: I had the same experience this week attempting to build an agent first by using the Playwright MCP server, realizing it was slow, token-inefficient, and flaky, and rewriting with direct Playwright calls.
MCP servers might be fun to get an idea for what's possible, and good for one-off mashups, but API calls are generally more efficient and stable, when you know what you want.
This is cool! Also have found the Playwright MCP implementation to be overkill and think of it more as a reference to an opinionated subset of the Playwright API.
LinkedIn has this reputation of being notorious about making it hard to build automations on top of, did you run into any roadblocks when building your personal LinkedIn agent?
We’re playing an endless cat and mouse game of capabilities between old and new right now.
Claude Code shows that the models can excel at using “old” programmatic interfaces (CLIs) to do Real Work™.
MCP is a way to dynamically provide “new” programmatic interfaces to the models.
At some point this will start to converge, or at least appear to do so, as the majority of tools a model needs will be in its pre-training set.
Then we’ll argue about MPPP (model pre-training protocol pipeline), and how to reduce knowledge pollution of all the LLM-generated tools we’re passing to the model.
Eventually we’ll publish the Merrium-Webster Model Tool Dictionary (MWMTD), surfacing all of the approved tools hidden in the pre-training set.
Then the kids will come up with Model Context Slang (MCS), in an attempt to use the models to dynamically choose unapproved tools, for much fun and enjoyment.
This is similar to the tool call (fixed code & dynamic params) vs code generation (dynamic code & dynamic params) discussion: tools offer contrains and save tokens, code gives you flexibility. Some papers suggest that generating code is often superior and this will likely become even more true as language models improve
I am working on a model that goes a step beyond and even makes the distinction between thinking and code execution unnecessary (it is all computation in the end), unfortunately no link to share yet
I have used MCP daily for a few months. I'm now down to a single MCP server: terminal (iTerm2). I have OpenAPI specs on hand if I ever need to provide them, but honestly shell commands and curl get you pretty damn far.
This is solved trivially by having default initial prompts.
All majors tools like Claude Code or Gemini CLI have ways to set them up.
> You pass all your tools to an LLM and ask it to filter it down based on the task at hand. So far, there hasn't been much better approaches proposed.
Why is a "better" approach needed? If modern LLMs can properly figure it out? It's not like LLMs don't keep getting better with larger and larger context length.
I never had a problem with an LLM struggling to use the appropriate MCP function on it's own.
> But you run into three problems: cost, speed, and general reliability
- cost: They keep getting cheaper and cheaper. It's ridiculously inexpensive for what those tools provide.
- speed: That seem extremely short sighted. No one is sitting idle looking at Claude Code in their terminal. And you can have more than one working on unrelated topics. That defeats the purpose. No matter how long it takes the time spent is purely bonus. You don't have to spend time in the loop when asking well defined tasks.
- reliability: Seem very prompt correlated ATM. I guess some people don't know what to ask which is the main issue.
Having LLMS being able to complete tedious tasks involving so many external tools at once is simply amazing thanks to MCP. Anecdotal but just today it did a task flawlessly involving: Notion pages, Linear Ticket, git, GitHub PR, GitHub CI logs.
Being in the loop was just submitting one review on the PR. All the while I was busy doing something else. And for what, ~1$?
"Cursor released a $200-a-month subscription then made their $20-a-month subscription worse (worse output, slower) - yet it seems even on Max they're rate limiting people!"
> This is solved trivially by having default initial prompts. All majors tools like Claude Code or Gemini CLI have ways to set them up.
That only makes it worse. The MCP tools available all add to the initial context. The more tools, the more of the context is populated by MCP tool definitions.
It's the wrong acronym. I wrote this blog post on the bike and used an LLM to fix up the dictation that I did. While I did edit it heavily and rewrote a lot of things, I did not end up noticing that my LLM expanded MCP incorrectly. It's Model Context Protocol.
Yup, I can't help but think that a lot of the bad thinking comes from trying to avoid the following fact: LLMs are only good where your output does not need to be precise and/or verifiably "perfect," which is kind of the opposite of how code has worked, or has tried to work, in the past.
Right now I got it for: DRAFTS of prose things -- and the only real killer in my opinion, autotagging thousands of old bookmarks. But again, that's just to have cool stuff to go back and peruse, not something that must be correc.t
The problem I see with MCP is very simple. It's using JSON as the format and that's nowhere as expressive as a programming language.
Consider a python function signature
list_containers(show_stopped: bool = False, name_pattern: Optional[str] = None, sort: Literal["size", "name", "started_at"] = "name"). It doesn't even need docs
Now convert this to JSON schema which is 4x larger input already.
And when generating output, the LLM will generate almost 2x more tokens too, because JSON. Easier to get confused.
And consider that the flow of calling python functions and using their output to call other tools etc... is seen 1000x more times in their fine tuning data, whereas JSON tool calling flows are rare and practically only exist in instruction tuning phase. Then I am sure instruction tuning also contains even more complex code examples where model has to execute complex logic.
Then theres the whole issue of composition. To my knowledge there's no way LLM can do this in one response.
vehicle = call_func_1()
if vehicle.type == "car":
details = lookup_car(vehicle.reg_no)
else if vehicle.type == "motorcycle":
details = lookup_motorcycle(vehicle.reg_ni)
the reason to use the llm is that you dont know ahead of time that the vehicle type is only a car or motorcycle, and the llm will also figure out a way to detail bycicles and boats and airplanes, and to consider both left and right shoes separately.
the llm cant just be given this function because its specialized to just the two options.
you could have it do a feedback loop of rewriting the python script after running it, but whats the savings at tha point? youre wasting tokens talking about cars in python when you already know is a ski, and the llm could ask directly for the ski details without writing a script to do it in between
But "the" problem with MCP? IMVHO (Very humble, non-expert) the half-baked or missing security aspects are more fundamental. I'd love to hear updates about that from ppl who know what they're talking about.
Honestly, I’m getting tired of these sweeping statements about what developers are supposed to be, how it’s “the right way to use AI”. We are in uncharted territories that are changing by the day. Maybe we have to drop the self-assurance and opinionated view points and tackle this like a scientific problem.
100% agreed - he mentions 3 barriers to using MCP over code: "cost, speed, and general reliability". But all 3 of these could change by 10-100x within a few years, if not months. Just recently OpenAI dropped the price of using o3 by 80%
This is not an environment where you can establish a durable manifesto
> So maybe we need to look at ways to find a better abstraction for what MCP is great at, and code generation. For that that we might need to build better sandboxes and maybe start looking at how we can expose APIs in ways that allow an agent to do some sort of fan out / fan in for inference. Effectively we want to do as much in generated code as we can, but then use the magic of LLMs after bulk code execution to judge what we did.
Por que no los dos? I ended up writing an OSS MCP server that securely executes LLM generated JavaScript using a C# JS interpreter (Jint) and handing it a `fetch` analogue as well as `jsonpath-plus`. Also gave it a built-in secrets manager.
Give it an objective and the LLM writes its own code and uses the tool iteratively to accomplish the task (as long as you can interact with it via a REST API).
For well known APIs, it does a fine job generating REST API calls.
I use Julia at work, which benefits from long-running sessions, because it compiles functions the first time they run. So I wrote a very simple MCP that lets Claude Code send code to a persistent Julia kernel using Jupyter.
It had a much bigger impact than I expected – not only does test code run much faster (and not time out), but Claude seems to be much more willing to just run functions from our codebase rather than do a bunch of bespoke bash stuff to try and make something work. It's anecdotal, but CCUsage says my token usage has dropped nearly 50% since I wrote the server.
Of course, it didn't have to be MCP – I could have used some other method to get Claude to run code from my codebase more frequently. The broader point is that it's much easier to just add a useful function to my codebase than it is to write something bespoke for Claude.
I always dreamed of a tool which would know the intent, semantic and constraints of all inputs and outputs of any piece of code and thus could combine these code pieces automatically. It was always a fuzzy idea in my head, but this piece now made it a bit more clear. While LLMs could generate those adapters between distinct pieces automatically, it's a expensive (latency, tokens) process. Having a system with which not only to type the variables, but also to type the types (intents, semantic meaning, etc.) would be helpful but likely not sufficient. There has been so much work on ontologies, semantic networks, logical inference, etc. but all of it is spread all over the place. I'd like to have something like this integrated into a programming language and see what it feels like.
You can combine MCPs within composable LLM generated code if you put in a little work. At Continual (https://continual.ai), we have many workflows that require bulk actions, e.g. iterating over all issues, files, customers, etc. We inject MCP tools into a sandboxed code interpreter and have the agent generate both direct MCP tool calls and composable scripts that leverage MCP tools depending on the task complexity. After a bunch of work it actually works quite well. We are also experimenting with continual learning via a Voyager like approach where the LLM can save tool scripts for future use, allowing lifelong learning for repeated workflows.
Tools are constraints and time/token savers. Code is expensive in terms of tokens and harder to constrain in environments that can’t be fully locked-down because network access for example is needed by the task. You need code AND tools.
Wouldn't the sweet spot for MCP be where the LLM is able to do most of the heavy lifting on its own (outputting some kind of structured or unstructured output), but needs a bit of external/dynamic data that it can't do without? The list of MCP servers/tools it can use should nail that external lookup in a (mostly) deterministic way.
This would work best if a human is the end consumer of this output, or will receive manual vetting eventually. I'm not sure I'd leave such a system running unsupervised in production ("the Automation at Scale" part mentioned by the OP).
MCP is literally the same as giving an LLM a set of man page summaries and a very limited shell over HTTP. It’s just in a different syntax (JSON instead of man macros and CLI args).
It would be better for MCP to deliver function definitions and let the LLM write little scripts in a simple language.
If it can be done on the command line , I'll just give the Agent permission to run commands.
But if I want the Agent to do something that can't be done with commands i.e. go into Google Docs and organize all my documents, then an MVP server would make sense.
I think the Playwright MCP is a really good example of the overall problem that the author brings up.
However, I couldn’t really understand if he’s saying that the Playwright MCP is good to use for your own app or whether he means for your own app just tell the LLM directly to export Playwright code.
I hit the same roadblock with MCP. If you work with data, LLM becomes a very expensive pipe with an added risk of hallucinations. It’s better to simply connect it to a Python environment enriched with integrations you need.
I frankly use tools mostly as an auth layer for things were raw access is too big a footgun without a permissions step. So I give the agent the choice of asking for permission to do things via the shell, or going nuts without user-interaction via a tool that enforces reasonable limitations.
Otherwise you can e.g just give it a folder of preapproved scripts to run and explain usage in a prompt.
Actually, this is the way human brain work. It's what we now know as system-1 (automatic, unconscious) and system-2 (effortful, conscious) described in Thinking Fast and Slow book.
On the idea of replacing ones self with a shell script, I think there's nothing stopping people (and it should probably be encouraged) with replacing ones use of an LLM with an LLM generated "shell script".
Using an LLM to count how many Rs are in the word strawberry is silly. Using it to write a script to reliably determine how many <LETTER> are in <WORD> is not so silly.
The same goes for many repeated task you'd have an LLM naively perform.
I think that is essentially what the article is getting at, but it's got very little to do with MCP. Perhaps the author has more familiarity with "slop" MCP tools than I do.
Unpopular Opinion: I hate Bash. Hate it. And hate the ecosystem of Unix CLIs that are from the 80s and have the most obtuse, inscrutable APIs ever designed. Also this ecosystem doesn’t work on Windows — which, as a game dev, is my primary environment. And no, WSL does not count.
I don’t think the world needs yet another shell scripting language. They’re all pretty mediocre at best. But maybe this is an opportunity to do something interesting.
Python environment is a clusterfuck. Which UV is rapidly bringing into something somewhat sane. Python isn’t the ultimate language. But I’d definitely be more interested in “replace yourself with a UV Python script” over “replace yourself with a shell script”. Would be nice to see use this as an opportunity to do better than Bash.
I realize this is unpopular. But unpopular doesn’t mean wrong.
tl;dr of one of today’s AI posts: all you need is code generation
It’s 2025 and this is the epitome of progress.
On the positive side code generation can be solid if you also have/can generate easy-to-read validation or tests for the generated code. I mean that you can read, of course.
Directionally I think this is right. Most LLM usage at scale tends to be filling the gaps between two hardened interfaces. The reliability comes not from the LLM inference and generation but the interfaces themselves only allowing certain configuration to work with them.
LLM output is often coerced back into something more deterministic such as types, or DB primary keys. The value of the LLM is determined by how well your existing code and tools model the data, logic, and actions of your domain.
In some ways I view LLMs today a bit like 3D printers, both in terms of hype and in terms of utility. They excel at quickly connecting parts similar to rapid prototyping with 3d printing parts. For reliability and scale you want either the LLM or an engineer to replace the printed/inferred connector with something durable and deterministic (metal/code) that is cheap and fast to run at scale.
Additionally, there was a minute during the 3D printer Gardner hype cycle where there were notions that we would all just print substantial amounts of consumer goods when the reality is the high utility use case are much more narrow. There is a corollary here to LLM usage. While LLMs are extremely useful we cannot rely on LLMs to generate or infer our entire operational reality or even engage meaningfully with it without some sort of pre-existing digital modeling as an anchor.
Hype cycle for drones and VR was similar -- at the peak, you have people claiming drones will take over package delivery and everyone will spend their day in VR. Reality is that the applicability is more narrow.
Interesting take but too bearish on LLMs in my opinion.
LLMs have already found large-scale usage (deep research, translation) which makes them more ubiquitous today than 3D printers ever will or could have been.
> Directionally I think this is right.
We have a term at work we use called, "directionally accurate", when it's not entirely accurate but headed in the right direction.
this is a really good take
Something I've realized about LLM tool use is that it means that if you can reduce a problem to something that can be solved by an LLM in a sandbox using tools in a loop, you can brute force that problem.
The job then becomes identifying those problems and figuring out how to configure a sandbox for them, what tools to provide and how to define the success criteria for the model.
That still takes significant skill and experience, but it's at a higher level than chewing through that problem using trial and error by hand.
My assembly Mandelbrot experiment was the thing that made this click for me: https://simonwillison.net/2025/Jul/2/mandelbrot-in-x86-assem...
> The job then becomes identifying those problems and figuring out how to configure a sandbox for them, what tools to provide, and how to define the success criteria for the model.
Your test case seems like a quintessential example where you're missing that last step.
Since it is unlikely that you understand the math behind fractals or x86 assembly (apologies if I'm wrong on this), your only means for verifying the accuracy of your solution is a superficial visual inspection, e.g. "Does it look like the Mandelbrot series?"
Ideally, your evaluation criteria would be expressed as a continuous function, but at the very least, it should take the form of a sufficiently diverse quantifiable set of discrete inputs and their expected outputs.
That’s super cool, I’m glad you shared this!
I’ve been thinking about using LLMs for brute forcing problems too.
Like LLMs kinda suck at typescript generics. They’re surprisingly bad at them. Probably because it’s easy to write generics that look correct, but are then screwy in many scenarios. Which is also why generics are hard for humans.
If you could have any LLM actually use TSC, it could run tests, make sure things are inferring correctly, etc. it could just keep trying until it works. I’m not sure this is a way to produce understandable or maintainable generics, but it would be pretty neat.
Also while typing this is realized that cursor can see typescript errors. All I need are some utility testing types, and I could have cursor write the tests and then brute force the problem!
If I ever actually do this I’ll update this comment lol
Makes sense.
I treat an LLM the same way I'd treat myself as it relates to context and goals when working with code.
"If I need to do __________ what do I need to know/see?"
I find that traditional tools, as per the OP, have become ever more powerful and useful in the age of LLMs (especially grep).
Furthermore, LLMs are quite good at working with shell tools and functionalities (heredoc, grep, sed, etc.).
Giving LLMs the right context -- eg in the form of predefined "cognitive tools", as explored with a ton of rigor here^1 -- seems like the way forward, at least to this casual observer.
1. https://github.com/davidkimai/Context-Engineering/blob/main/...
(the repo is a WIP book, I've only scratched the surface but it seems pretty brilliant to me)
> LLM in a sandbox using tools in a loop, you can brute force that problem
Does this require using big models through their APIs and spending a lot of tokens?
Or can this be done either with local models (probably very slow), or with subscriptions like Claude Code with Pro (without hitting the rate/usage limits)?
I saw the Mandelbrot experiment, it was very cool, but still a rather small project, not really comparable to a complex/bigger/older code base for a platform used in production
I've been using a VM for a sandbox, just to make sure it won't delete my files if it goes insane.
With some host data directories mounted read only inside the VM.
This creates some friction though. Feels like a tool which runs the AI agent in a VM, but then copies it's output to the host machine after some checks would help, so that it would feel that you are running it natively on the host.
One of my biggest, ongoing challenges has been to get the LLM to use the tool(s) that are appropriate for the job. It feels like teach your kids to say, do laundry and you want to just tell them to step aside and let you do it.
> try completing a GitHub task with the GitHub MCP, then repeat it with the gh CLI tool. You'll almost certainly find the latter uses context far more efficiently and you get to your intended results quicker.
This is spot on. I have a "devops" folder with a CLAUDE.md with bash commands for common tasks (e.g. find prod / staging logs with this integration ID).
When I complete a novel task (e.g. count all the rows that were synced from stripe to duckdb) I tell Claude to update CLAUDE.md with the example. The next time I ask a similar question, Claude one-shots it.
This is the first few lines of the CLAUDE.md
I feel like I'm taking crazy pills sometimes. You have a file with a set of snippets and you prefer to ask the AI to hopefully run them instead of just running it yourself?
Just as a related aside, you could literally make that bottom section into a super simple stdio MCP Server and attach that to Claude Code. Each of your operations could be a tool and have a well-defined schema for parameters. Then you are giving the LLM a more structured and defined way to access your custom commands. I'm pretty positive there are even pre-made MCP Servers that are designed for just this activity.
Edit: First result when looking for such an MCP Server: https://github.com/inercia/MCPShell
I use a similar file, but just for myself (I've never used an LLM "agent"). I live in Emacs, but this is the only thing I use org-mode for: it lets me fold/unfold the sections, and I can press C-c C-c over any of the code snippets to execute it. Some of them are shell code, some of them are Emacs Lisp code which generates shell code, etc.
I do something similar, but the problem is that claude.md keeps on growing.
To tackle this, I converted a custom prompt into an application, but there is an interesting trade-off. The application is deterministic. It cannot deal with unknown situations. In contrast to CC, which is way slower, but can try alternative ways of dealing with an unknown situation.
I ended up with adding an instruction to the custom command to run the application and fix the application code (TDD) if there is a problem. Self healing software… who ever thought
You're letting the LLM execute privileged API calls against your production/test/staging environment, just hoping it won't corrupt something, like truncate logs, files, databases etc?
Or are you asking it to provide example commands that you can sanity check?
I'd be curious to see some more concrete examples.
I have the feeling that's not really MCP specifically VS other ways, but it is pretty simply: at the current state of AI, to have a human in the loop is much better. LLMs are great at certain tasks but they often get trapped into local minima, if you do the back and forth via the web interface of LLMs, ask it to write a program, look at it, provide hints to improve it, test it, ..., you get much better results and you don't cut yourself out to find a 10k lines of code mess that could be 400 lines of clear code. That's the current state of affairs, but of course many will try very hard to replace programmers that is currently not possible. What it is possible is to accelerate the work of a programmer several times (but they must be good both at programming and LLM usage), or take a smart person that has a relatively low skill in some technology, and thanks to LLM make this person productive in this field without the long training otherwise needed. And many other things. But "agentic coding" right now does not work well. This will change, but right now the real gain is to use the LLM as a colleague.
It is not MCP: it is autonomous agents that don't get feedbacks from smart humans.
So I run my own business (product), I code everything, and I use claude-code. I also wear all the other hats and so I'd be happy to let Claude handle all of the coding if / when it can. I can confirm we're certainly not there yet.
It's definitely useful, but you have to read everything. I'm working in a type-safe functional compiled language too. I'd be scared to try this flow in a less "correctness enforced" language.
That being said, I do find that it works well. It's not living up to the hype, but most of that hype was obvious nonsense. It continues to surprise me with it's grasp on concepts and is definitely saving me some time, and more importantly making some larger tasks more approachable since I can split my time better.
My absolute favorite use of MCP so far is Bruce Hauman's clojure-mcp. In short, it gives the LLM (a) a bash tool, (b) a persistent Clojure REPL, and (c) structural editing tools.
The effect is that it's far more efficient at editing Clojure code than any purely string-diff-based approach, and if you write a good test suite it can rapidly iterate back and forth just editing files, reloading them, and then re-running the test suite at the REPL -- just like I would. It's pretty incredible to watch.
I was just going to comment about clojure-mcp!! It’s far and away the coolest use of mcp I’ve seen so far.
It can straight up debug your code, eval individual expressions, document return types of functions. It’s amazing.
It actually makes me think that languages with strong REPLs are a better for languages than those without. Seeing clojure-mcp do its thing is the most impressive AI feat I’ve seen since I saw GPT-3 in action for the first time
https://github.com/bhauman/clojure-mcp
I think the GitHub CLI example isn't entirely fair to MCP. Yes, GitHub's CLI is extensively documented online, so of course LLMs will excel at generating code for well-known tools. But MCP shines in different scenarios.
Consider internal company tools or niche APIs with minimal online documentation. Sure, you could dump all the documentation into context for code generation, but that often requires more context than interacting with an MCP tool. More importantly, generated code for unfamiliar APIs is prone to errors so you'd need robust testing and retry mechanisms built in to the process.
With MCP, if the tools are properly designed and receive correct inputs, they work reliably. The LLM doesn't need to figure out API intricacies, authentication flows, or handle edge cases - that's already handled by the MCP server.
So I agree MCP for GitHub is probably overkill but there are many legitimate use cases where pre-built MCP tools make more sense than asking an LLM to reverse-engineer poorly documented or proprietary systems from scratch.
> Sure, you could dump all the documentation into context for code generation, but that often requires more context than interacting with an MCP tool.
MCP works exactly that way: you dump documentation into the context. That's how the LLM knows how to call your tool. Even for custom stuff I noticed that giving the LLM things to work with that it knows (eg: python, javascript, bash) beats it using MCP tool calling, and in some ways it wastes less context.
YMMV, but I found the limit of tools available to be <15 with sonnet4. That's a super low amount. Basically the official playwright MCP alone is enough to fully exhaust your available tool space.
That's handled by the MCP server in the sense of it doesn't do authentication, etc. it provides a simplified view of the world.
If that's what you wanted you could have designed that as your poorly documented internal API differently to begin with. There's zero advantage to MCP in the scenario you describe aside from convincing people that their original API is too hard to use.
Regarding the Playwright example: I had the same experience this week attempting to build an agent first by using the Playwright MCP server, realizing it was slow, token-inefficient, and flaky, and rewriting with direct Playwright calls.
MCP servers might be fun to get an idea for what's possible, and good for one-off mashups, but API calls are generally more efficient and stable, when you know what you want.
Here's the agent I ended up writing: https://github.com/pamelafox/personal-linkedin-agent
Demo: https://www.youtube.com/live/ue8D7Hi4nGs
This is cool! Also have found the Playwright MCP implementation to be overkill and think of it more as a reference to an opinionated subset of the Playwright API.
LinkedIn has this reputation of being notorious about making it hard to build automations on top of, did you run into any roadblocks when building your personal LinkedIn agent?
We’re playing an endless cat and mouse game of capabilities between old and new right now.
Claude Code shows that the models can excel at using “old” programmatic interfaces (CLIs) to do Real Work™.
MCP is a way to dynamically provide “new” programmatic interfaces to the models.
At some point this will start to converge, or at least appear to do so, as the majority of tools a model needs will be in its pre-training set.
Then we’ll argue about MPPP (model pre-training protocol pipeline), and how to reduce knowledge pollution of all the LLM-generated tools we’re passing to the model.
Eventually we’ll publish the Merrium-Webster Model Tool Dictionary (MWMTD), surfacing all of the approved tools hidden in the pre-training set.
Then the kids will come up with Model Context Slang (MCS), in an attempt to use the models to dynamically choose unapproved tools, for much fun and enjoyment.
Ad infinitum.
This is similar to the tool call (fixed code & dynamic params) vs code generation (dynamic code & dynamic params) discussion: tools offer contrains and save tokens, code gives you flexibility. Some papers suggest that generating code is often superior and this will likely become even more true as language models improve
[1] https://huggingface.co/papers/2402.01030
[2] https://huggingface.co/papers/2401.00812
[3] https://huggingface.co/papers/2411.01747
I am working on a model that goes a step beyond and even makes the distinction between thinking and code execution unnecessary (it is all computation in the end), unfortunately no link to share yet
More appropriately: the terminal is all you need.
I have used MCP daily for a few months. I'm now down to a single MCP server: terminal (iTerm2). I have OpenAPI specs on hand if I ever need to provide them, but honestly shell commands and curl get you pretty damn far.
I never knew how far it was possible to go in bash shell with the built-in tools until I saw the LLMs use them.
> It demands too much context.
This is solved trivially by having default initial prompts. All majors tools like Claude Code or Gemini CLI have ways to set them up.
> You pass all your tools to an LLM and ask it to filter it down based on the task at hand. So far, there hasn't been much better approaches proposed.
Why is a "better" approach needed? If modern LLMs can properly figure it out? It's not like LLMs don't keep getting better with larger and larger context length. I never had a problem with an LLM struggling to use the appropriate MCP function on it's own.
> But you run into three problems: cost, speed, and general reliability
- cost: They keep getting cheaper and cheaper. It's ridiculously inexpensive for what those tools provide.
- speed: That seem extremely short sighted. No one is sitting idle looking at Claude Code in their terminal. And you can have more than one working on unrelated topics. That defeats the purpose. No matter how long it takes the time spent is purely bonus. You don't have to spend time in the loop when asking well defined tasks.
- reliability: Seem very prompt correlated ATM. I guess some people don't know what to ask which is the main issue.
Having LLMS being able to complete tedious tasks involving so many external tools at once is simply amazing thanks to MCP. Anecdotal but just today it did a task flawlessly involving: Notion pages, Linear Ticket, git, GitHub PR, GitHub CI logs. Being in the loop was just submitting one review on the PR. All the while I was busy doing something else. And for what, ~1$?
> cost: They keep getting cheaper and cheaper
no they don't[0], the cost is just still hidden from you but the freebies will end just like MoviePass and cheap Ubers
https://bsky.app/profile/edzitron.com/post/3lsw4vatg3k2b
"Cursor released a $200-a-month subscription then made their $20-a-month subscription worse (worse output, slower) - yet it seems even on Max they're rate limiting people!"
https://bsky.app/profile/edzitron.com/post/3lsw3zwgw4c2h
> This is solved trivially by having default initial prompts. All majors tools like Claude Code or Gemini CLI have ways to set them up.
That only makes it worse. The MCP tools available all add to the initial context. The more tools, the more of the context is populated by MCP tool definitions.
> This is a significant advantage that an MCP (Multi-Component Pipeline) typically cannot offer
Oh god please no, we must stop this initialism. We've gone too far.
It's the wrong acronym. I wrote this blog post on the bike and used an LLM to fix up the dictation that I did. While I did edit it heavily and rewrote a lot of things, I did not end up noticing that my LLM expanded MCP incorrectly. It's Model Context Protocol.
We're all in line to get de-rezzed by the MCP, one way or another.
Yup, I can't help but think that a lot of the bad thinking comes from trying to avoid the following fact: LLMs are only good where your output does not need to be precise and/or verifiably "perfect," which is kind of the opposite of how code has worked, or has tried to work, in the past.
Right now I got it for: DRAFTS of prose things -- and the only real killer in my opinion, autotagging thousands of old bookmarks. But again, that's just to have cool stuff to go back and peruse, not something that must be correc.t
The problem I see with MCP is very simple. It's using JSON as the format and that's nowhere as expressive as a programming language.
Consider a python function signature
list_containers(show_stopped: bool = False, name_pattern: Optional[str] = None, sort: Literal["size", "name", "started_at"] = "name"). It doesn't even need docs
Now convert this to JSON schema which is 4x larger input already.
And when generating output, the LLM will generate almost 2x more tokens too, because JSON. Easier to get confused.
And consider that the flow of calling python functions and using their output to call other tools etc... is seen 1000x more times in their fine tuning data, whereas JSON tool calling flows are rare and practically only exist in instruction tuning phase. Then I am sure instruction tuning also contains even more complex code examples where model has to execute complex logic.
Then theres the whole issue of composition. To my knowledge there's no way LLM can do this in one response.
How is JSON tool calling going to solve this?the reason to use the llm is that you dont know ahead of time that the vehicle type is only a car or motorcycle, and the llm will also figure out a way to detail bycicles and boats and airplanes, and to consider both left and right shoes separately.
the llm cant just be given this function because its specialized to just the two options.
you could have it do a feedback loop of rewriting the python script after running it, but whats the savings at tha point? youre wasting tokens talking about cars in python when you already know is a ski, and the llm could ask directly for the ski details without writing a script to do it in between
Great point.
But "the" problem with MCP? IMVHO (Very humble, non-expert) the half-baked or missing security aspects are more fundamental. I'd love to hear updates about that from ppl who know what they're talking about.
Honestly, I’m getting tired of these sweeping statements about what developers are supposed to be, how it’s “the right way to use AI”. We are in uncharted territories that are changing by the day. Maybe we have to drop the self-assurance and opinionated view points and tackle this like a scientific problem.
100% agreed - he mentions 3 barriers to using MCP over code: "cost, speed, and general reliability". But all 3 of these could change by 10-100x within a few years, if not months. Just recently OpenAI dropped the price of using o3 by 80%
This is not an environment where you can establish a durable manifesto
Give it an objective and the LLM writes its own code and uses the tool iteratively to accomplish the task (as long as you can interact with it via a REST API).
For well known APIs, it does a fine job generating REST API calls.
You can pretty much do anything with this.
https://github.com/CharlieDigital/runjs
I use Julia at work, which benefits from long-running sessions, because it compiles functions the first time they run. So I wrote a very simple MCP that lets Claude Code send code to a persistent Julia kernel using Jupyter.
It had a much bigger impact than I expected – not only does test code run much faster (and not time out), but Claude seems to be much more willing to just run functions from our codebase rather than do a bunch of bespoke bash stuff to try and make something work. It's anecdotal, but CCUsage says my token usage has dropped nearly 50% since I wrote the server.
Of course, it didn't have to be MCP – I could have used some other method to get Claude to run code from my codebase more frequently. The broader point is that it's much easier to just add a useful function to my codebase than it is to write something bespoke for Claude.
I always dreamed of a tool which would know the intent, semantic and constraints of all inputs and outputs of any piece of code and thus could combine these code pieces automatically. It was always a fuzzy idea in my head, but this piece now made it a bit more clear. While LLMs could generate those adapters between distinct pieces automatically, it's a expensive (latency, tokens) process. Having a system with which not only to type the variables, but also to type the types (intents, semantic meaning, etc.) would be helpful but likely not sufficient. There has been so much work on ontologies, semantic networks, logical inference, etc. but all of it is spread all over the place. I'd like to have something like this integrated into a programming language and see what it feels like.
You can combine MCPs within composable LLM generated code if you put in a little work. At Continual (https://continual.ai), we have many workflows that require bulk actions, e.g. iterating over all issues, files, customers, etc. We inject MCP tools into a sandboxed code interpreter and have the agent generate both direct MCP tool calls and composable scripts that leverage MCP tools depending on the task complexity. After a bunch of work it actually works quite well. We are also experimenting with continual learning via a Voyager like approach where the LLM can save tool scripts for future use, allowing lifelong learning for repeated workflows.
Tools are constraints and time/token savers. Code is expensive in terms of tokens and harder to constrain in environments that can’t be fully locked-down because network access for example is needed by the task. You need code AND tools.
Wouldn't the sweet spot for MCP be where the LLM is able to do most of the heavy lifting on its own (outputting some kind of structured or unstructured output), but needs a bit of external/dynamic data that it can't do without? The list of MCP servers/tools it can use should nail that external lookup in a (mostly) deterministic way.
This would work best if a human is the end consumer of this output, or will receive manual vetting eventually. I'm not sure I'd leave such a system running unsupervised in production ("the Automation at Scale" part mentioned by the OP).
"Perhaps as a non-software engineer, code is out of reach."
Perhaps. Perhaps not. I am a "non-software engineer" and I use shell scripts to automate tasks every day.
I think there is actually too much code within reach. It gets in the way. Every day, more code.
One reason I like the Bourne shell is that it does not change much. It just keeps working.
Looks familiar, it seems to share some ideas with this one: LLM function calls dont scale, code orchestration is simpler, more effective
Source: https://jngiam.bearblog.dev/mcp-large-data/ HN: https://news.ycombinator.com/item?id=44053744
The problem with this is that you have to give your LLM basically unbounded access to everything you have access to, which is a recipe for pain.
I wonder if having 2 LLM's communicate will eventually be more like humans talking. With all the same problems.
MCP is literally the same as giving an LLM a set of man page summaries and a very limited shell over HTTP. It’s just in a different syntax (JSON instead of man macros and CLI args).
It would be better for MCP to deliver function definitions and let the LLM write little scripts in a simple language.
If it can be done on the command line , I'll just give the Agent permission to run commands.
But if I want the Agent to do something that can't be done with commands i.e. go into Google Docs and organize all my documents, then an MVP server would make sense.
I think the Playwright MCP is a really good example of the overall problem that the author brings up.
However, I couldn’t really understand if he’s saying that the Playwright MCP is good to use for your own app or whether he means for your own app just tell the LLM directly to export Playwright code.
I hit the same roadblock with MCP. If you work with data, LLM becomes a very expensive pipe with an added risk of hallucinations. It’s better to simply connect it to a Python environment enriched with integrations you need.
I typically do this too rather than use MCP. Have the LLM write the tool, along with tests to get it working, then use the tool.
I frankly use tools mostly as an auth layer for things were raw access is too big a footgun without a permissions step. So I give the agent the choice of asking for permission to do things via the shell, or going nuts without user-interaction via a tool that enforces reasonable limitations.
Otherwise you can e.g just give it a folder of preapproved scripts to run and explain usage in a prompt.
Anyone else switch their LLM subscription every month? I'm back on ChatGPT for O3 use, but expect that Grok4 will be next.
I would like to see MPC integrate the notion of hypermedia controls.
Seems like that would be a potential way to get self-organizing integrations.
Off topic: That font/layout/contrast on the page is very pleasing and inviting.
Actually, this is the way human brain work. It's what we now know as system-1 (automatic, unconscious) and system-2 (effortful, conscious) described in Thinking Fast and Slow book.
makes sense and if realized then deno is in excellent position to be one of the leading if not the main sandbox runtime for agents
On the idea of replacing ones self with a shell script, I think there's nothing stopping people (and it should probably be encouraged) with replacing ones use of an LLM with an LLM generated "shell script".
Using an LLM to count how many Rs are in the word strawberry is silly. Using it to write a script to reliably determine how many <LETTER> are in <WORD> is not so silly.
The same goes for many repeated task you'd have an LLM naively perform.
I think that is essentially what the article is getting at, but it's got very little to do with MCP. Perhaps the author has more familiarity with "slop" MCP tools than I do.
what does "MCP" stand for?
It's finally happening. The acceleration of the full AI disillusionment:
- LLMs will do everything.
- Shit, they won't. I'll do some traditional programming to put it on a leash.
- More traditional programming.
- Wait, this traditional programming thing is quite good.
- I barely use LLMs now.
- Man, that LLM stuff was a bad trip.
See you all on the other side!
Unpopular Opinion: I hate Bash. Hate it. And hate the ecosystem of Unix CLIs that are from the 80s and have the most obtuse, inscrutable APIs ever designed. Also this ecosystem doesn’t work on Windows — which, as a game dev, is my primary environment. And no, WSL does not count.
I don’t think the world needs yet another shell scripting language. They’re all pretty mediocre at best. But maybe this is an opportunity to do something interesting.
Python environment is a clusterfuck. Which UV is rapidly bringing into something somewhat sane. Python isn’t the ultimate language. But I’d definitely be more interested in “replace yourself with a UV Python script” over “replace yourself with a shell script”. Would be nice to see use this as an opportunity to do better than Bash.
I realize this is unpopular. But unpopular doesn’t mean wrong.
tl;dr of one of today’s AI posts: all you need is code generation
It’s 2025 and this is the epitome of progress.
On the positive side code generation can be solid if you also have/can generate easy-to-read validation or tests for the generated code. I mean that you can read, of course.
Isn't it a bit like saying: saw is all you need (for carpenters)?
I mean, you _probably_ could make most furniture with only a saw, but why?