I had 30 broken Playwright tests and no way to tell which ones actually mattered. The problem wasn’t “fix the tests” — it was that there’s no coverage tool for test infrastructure trustworthiness. I had to build the ruler before I could measure anything.
So I wrote a file that defined a composite metric (four weighted components → one score), an improvement loop, and constraints. Pointed Claude at it. Went to bed. Woke up to 12 commits, 47 → 83.
The file became GOAL.md. The insight that surprised me: most software doesn’t have a natural scalar metric like val_bpb. You have to construct it. Documentation quality, API trustworthiness, test infrastructure confidence — these things have no pytest –cov equivalent. But once you build the ruler, the autoresearch loop works on them too.
The part I’m most uncertain about: the “dual score” pattern. When the agent is building its own measuring tools, it can game the metric by weakening the instrument. So the docs-quality example has two scores — one for the docs, one for the linter itself. The agent has to improve the telescope before it can use it. I think this is load-bearing but I’d love to hear if others have found different solutions to the same problem.
Easiest way to try it: paste this into Claude Code, Cursor, or any coding agent and point it at one of your repos:
Read github.com/jmilinovich/goal-md — read the template and examples.
Then write me a GOAL.md for this repo and start working on it.
Happy to hear what breaks. The scoring script is bash + jq so it’s not exactly production-grade, and the examples are biased toward the kinds of projects I work on. More examples from different domains would make the pattern sharper.
Autoresearch is awesome for stuff that has really clear loss functions but most problems don’t have that. So if you’re trying to improve product quality or write great docs you can use goal.md
I had 30 broken Playwright tests and no way to tell which ones actually mattered. The problem wasn’t “fix the tests” — it was that there’s no coverage tool for test infrastructure trustworthiness. I had to build the ruler before I could measure anything.
So I wrote a file that defined a composite metric (four weighted components → one score), an improvement loop, and constraints. Pointed Claude at it. Went to bed. Woke up to 12 commits, 47 → 83.
The file became GOAL.md. The insight that surprised me: most software doesn’t have a natural scalar metric like val_bpb. You have to construct it. Documentation quality, API trustworthiness, test infrastructure confidence — these things have no pytest –cov equivalent. But once you build the ruler, the autoresearch loop works on them too.
The part I’m most uncertain about: the “dual score” pattern. When the agent is building its own measuring tools, it can game the metric by weakening the instrument. So the docs-quality example has two scores — one for the docs, one for the linter itself. The agent has to improve the telescope before it can use it. I think this is load-bearing but I’d love to hear if others have found different solutions to the same problem.
Easiest way to try it: paste this into Claude Code, Cursor, or any coding agent and point it at one of your repos:
Read github.com/jmilinovich/goal-md — read the template and examples. Then write me a GOAL.md for this repo and start working on it.
Happy to hear what breaks. The scoring script is bash + jq so it’s not exactly production-grade, and the examples are biased toward the kinds of projects I work on. More examples from different domains would make the pattern sharper.
Thanks for sharing how this came to be - really sets the post apart.
Appreciate that! This fast feedback loop is what I like most about personal software.
Seems like the same utility as autoresearch, but much simpler. Does autoresearch work better?
Autoresearch is awesome for stuff that has really clear loss functions but most problems don’t have that. So if you’re trying to improve product quality or write great docs you can use goal.md