From Autoresearch to Autoimprove: Generalizing the Agentic Experiment Loop

Two unrelated projects discovered the same pattern: point an AI agent at code, give it a score to chase, and let it run experiments until morning. I generalized it into a tool that works for any domain.

Last week, two things landed on my radar within hours of each other.

Andrej Karpathy published autoresearch, a setup where an AI agent modifies a GPT training script, trains for five minutes, checks the validation loss, and either keeps the change or rolls it back. Then it does it again. And again. You go to sleep. You wake up to 100 experiments, a log of what worked and what didn’t, and a better model.

The same week, Shopify CEO Tobias Lutke opened a PR against Liquid, their 20-year-old Ruby template engine. He’d run roughly 120 automated experiments using the same loop: modify code, run tests, benchmark, keep or discard. The result was a 53% speedup and 61% fewer object allocations. On a codebase hundreds of contributors have already optimized over two decades.

I kept staring at both of these. They were built independently, in different languages, for completely different problems. But the loop was identical.

autoimprove loop

The loop

Propose a change
  -> git commit
  -> run tests (fail? reset, try something else)
  -> run benchmark (extract a score)
  -> score improved? keep the commit : git reset
  -> log the experiment
  -> repeat

That’s it. The human writes the instructions. The agent writes the code. Git is the undo mechanism. The score is the compass. Everything else is detail.

Karpathy called the instructions “research org code.” You’re not programming in Python or Ruby anymore. You’re programming in Markdown, describing the process the agent should follow. The meta-optimization problem is writing better instructions, not better code.

Generalizing the pattern

Both implementations were tightly coupled to their domains. Karpathy’s was for LLM training. Tobi’s was for Ruby performance. But strip away the domain specifics and you’re left with five pieces: a file to change, a command to run, a number to improve, a time budget, and a keep/discard rule backed by git.

I built autoimprove to make this work for anything. It’s a Claude Code skill (also agent-agnostic via an exported protocol). You can write an improve.md yourself, or just run /autoimprove in your project and it figures out what’s needed: detects the repo type, scaffolds the config, builds the eval harness, generates safety tests, establishes a baseline, and starts the loop. One command.

If you want full control, the improve.md looks like this:

# autoimprove: faster-api

## Change
scope: the checkout handler and its database queries

## Check
test: go test ./...
run: go test -bench=. -benchmem
score: ns/op:\s+([\d.]+)
goal: lower
timeout: 3m

## Stop
budget: 4h
stale: 15

## Instructions
Reduce allocations in hot paths.
Try buffer reuse, fast-path patterns, byte-level operations.

The improve.md is both the config and the prompt. The structured headers tell the agent what to change, how to measure, and when to stop. Everything after ## Instructions is free-form domain guidance.

What I learned running it for real

I pointed autoimprove at a RAG search engine I’d built. Hybrid search over 44,000 chunks, combining BM25 keyword search with cosine similarity embeddings. Twenty golden queries with expected results. A combined score of guest hit rate, MRR, keyword coverage, and snippet matching.

Fourteen experiments over two rounds. Baseline score: 0.42. Final score: 0.46. A 9.3% improvement.

The wins and failures tell different stories.

Reciprocal Rank Fusion replaced my weighted score merge (0.3 * BM25 + 0.7 * cosine). This was the biggest single improvement at +3.6%. The old approach required normalizing BM25 scores and cosine similarity to the same 0-1 range, which is fragile because min-max normalization depends on the worst result in each batch. RRF sidesteps this by using ranks instead of scores. A result ranked #1 in keyword search and #3 in semantic search gets a consistent RRF contribution regardless of raw score magnitudes.

OR-mode keyword search changed the FTS5 query from implicit AND (every token must match) to explicit OR (any token counts). A chunk about “growth loops” now surfaces for the query “how to run growth experiments” even if it never mentions the word “experiments.” Broader recall, with RRF naturally suppressing the noise since chunks matching both keyword AND semantic rank highest.

BM25 column weights were a small change with outsized impact. The FTS5 table had five columns: chunk_id, guest_name, speaker, timestamp, text. Without weights, a keyword hit in chunk_id (meaningless) scored the same as a hit in the actual text content. Adding bm25(chunks_fts, 0.0, 2.0, 1.0, 0.0, 10.0) to weight text at 10x fixed this. MRR jumped from 0.196 to 0.229.

Fetch limit 4x had failed in round 1 (too noisy with AND-mode keywords) but succeeded in round 2 after OR-mode and BM25 weights improved the quality of keyword candidates. Same idea, different context, opposite result. Experiments interact in ways you can’t predict upfront.

The failures were just as useful. Guest name boosting (-4.8%) hurt because most experts don’t have query keywords in their names (“Madhavan Ramanujam” doesn’t contain “pricing”). Query-text overlap boosting (-3.4%) favored chunks using the query’s exact vocabulary, but the most relevant content often uses different words (“willingness to pay” for a “pricing” query). Both failed for the same reason: they added a lexical signal on top of a system that already had a good one (BM25) and a better one (embeddings).

The eval harness is the real work

Building the golden set and evaluation script took longer than running all 14 experiments.

The loop is mechanical. An agent can do it all day. But creating “a command that produces a meaningful number” is a design problem. What counts as a good result? Which queries represent real use cases? How do you weight precision against recall against MRR?

For Docker image size or build time, this problem doesn’t exist. The benchmark is the eval. But for anything involving quality, someone has to decide what “good” looks like and encode it.

This was the biggest gap in the first version of autoimprove. It assumed you already had an eval. Most people don’t. So now /autoimprove handles it automatically: when it detects there’s no eval harness, it checks if the domain needs a golden set (RAG, prompts, and ML do; perf, Docker, and CI don’t) and walks you through building one. It runs the system with sample inputs, asks you to label the outputs, and assembles a golden set from your judgments. The labeling still requires you, because domain knowledge can’t be automated. But the scaffolding around it can.

The goal tells you what to test

I didn’t expect test generation to be useful. It was the most useful part.

The idea is simple: your optimization goal predicts what the agent will try, and that predicts what it will break. Chasing speed? The agent will skip edge cases and remove nil checks. Chasing accuracy? It’ll overfit or leak test data. Shrinking an image? It’ll remove things that were actually runtime dependencies.

For the RAG search engine, bootstrap generated tests for special character handling in queries. Those tests immediately caught a crash: FTS5 interprets hyphens as the NOT operator, so “product-market fit” was being parsed as “product NOT market fit” and crashing with a SQLite error. The agent never would have found this bug by tuning search scores. The test found it before the first experiment ran.

The tests become immutable during the optimization loop. The agent can’t “improve” the score by weakening the guardrails.

Not AutoML

AutoML searches a predefined grid. Learning rate 0.01 or 0.001. Max depth 5 or 10. Estimator A or B. The space is bounded.

Autoimprove is unbounded. The agent can delete 40 lines and replace them with 5. It can replace an entire algorithm with a different one (RRF replacing weighted merge wasn’t a hyperparameter tweak, it was a structural change). No grid search would find that, because nobody put it in the grid.

The ceiling for AutoML is the best combination you parameterized. The ceiling for autoimprove is whatever the agent can think of.

Any agent, any language

The skill runs in Claude Code, but the format works anywhere. Export a program.md and point any agent at it:

codex -p "follow program.md"
gemini -p "follow program.md"

Human programs in Markdown. Agent programs in whatever language the codebase uses.

What’s next

The pattern works for single-file or small-scope targets: a search algorithm, a training loop, a Dockerfile, a prompt. Multi-scope optimization, where the agent changes multiple files simultaneously, is harder. Experiments become more expensive and the dependency graph matters.

The more interesting direction is parallel optimization. Multiple agents, each running autoimprove on different parts of the same system, merging improvements through git. The experiment tracking already supports branching. Nobody’s tried it at scale yet.

What I keep coming back to: the meta-skill isn’t writing code. It’s writing the instructions for the process that writes the code. Karpathy calls it “research org code.” I think he’s right that this is where the leverage moves.

The full walkthrough

autoimprove on GitHub | Full-screen presentation