From autoresearch
to autoimprove

Generalizing the agentic experiment loop

Inspired by Karpathy's autoresearch and Shopify Liquid PR #2056

The pattern

Two projects independently discovered the same loop:

Karpathy (autoresearch)

Agent modifies train.py, trains for 5 min, checks val_bpb, keeps or discards. ~100 experiments overnight.

Tobi / Shopify (Liquid)

Agent modifies Ruby source, runs tests + benchmarks, keeps or discards. ~120 experiments. Result: 53% faster.

The human writes the process. The agent writes the code.

The loop

LOOP:
  1. Agent proposes a change
  2. git commit (verified)
  3. Run tests → fail? → git reset, next
  4. Run benchmark → extract score
  5. Check guards → violated? → git reset, next
  6. Score improved? → keep : git reset
  7. Log experiment → check stop conditions
  8. Repeat

One file: improve.md

Part config, part prompt. Describes what to change, how to measure, and when to stop.

# autoimprove: make-it-faster
## Change
scope: the checkout handler and its database queries
exclude: test/, vendor/
## Check
test: go test ./...
run: go test -bench=. -benchmem
score: ns/op:\s+([\d.]+)
goal: lower
guard: allocs/op: ([\d.]+) < 500
keep_if_equal: true
timeout: 3m
## Stop
budget: 4h
stale: 15
## Instructions
Reduce allocations in hot paths. Try buffer reuse, fast-path patterns.

Natural language scope

You describe what to optimize. The agent resolves it to specific files.

## Change
scope: the template parsing engine
exclude: test/, benchmark/
Resolved scope "the template parsing engine" to:
  - lib/liquid/parser.rb
  - lib/liquid/lexer.rb
  - lib/liquid/variable.rb
These are the ONLY files that will be modified. Confirm? [y/n]

exclude prevents the agent from grading its own homework.

Three-layer check

Tests

Correctness gate. Must pass for any experiment to be kept. Generated by goal-aware bootstrap.

Score

The metric to optimize. Extracted from stdout via convention, regex, or jq.

Guards

Secondary metrics that must not regress. Prevents improving speed by breaking reliability.

test: go test ./...                  # gate
run: go test -bench=. -benchmem     # score
score: ns/op:\s+([\d.]+)
guard: allocs/op: ([\d.]+) < 500    # guard

Goal-aware bootstrap

The optimization goal predicts what the agent will break.

GoalAgent will try to...Tests guard against...
FasterSkip work, remove checksEdge cases, unicode, nil, concurrency
SmallerRemove things, swap depsFeatures still work, runtime deps present
More accurateOverfit, leak dataData leakage, reproducibility, valid outputs
Better RAGGame retrieval, stuff contextFormat consistency, hallucination, empty results
Lower costDownsize, cut redundancyLoad handling, failover, durability
/autoimprove bootstrap --generate

Auto-guided setup

One command. The agent detects what's missing and walks you through it.

/autoimprove

Checking readiness...
  1. improve.md          Not found → scaffold from repo type
  2. Scope resolution     Resolved to 3 files. Confirm?
  3. Eval harness         RAG detected → build golden set
  4. Test suite           No tests → generate goal-aware tests
  5. Git state            Clean
  6. Baseline             Score: 0.42, errors: 0%

Ready. Starting optimization loop.

10 domain templates

TypeWhat it optimizesTypical metric
perfCode performancens/op, req/sec, allocations
mlML trainingval_bpb, loss
automlTabular MLAUC-ROC, F1
ragRAG pipelineanswer relevancy, faithfulness
dockerContainer sizeimage bytes
k8sCluster healthrunning pod count
promptPrompt qualityF1, accuracy
sqlQuery performanceexecution time
frontendBundle sizebundle bytes
ciBuild speedbuild time
/autoimprove init --type rag

Example: performance

# autoimprove: faster-checkout-api
## Change
scope: the checkout handler and its database queries
exclude: test/, vendor/
## Check
test: go test ./...
run: hey -n 1000 http://localhost:8080/checkout
score: Requests/sec:\s+([\d.]+)
goal: higher
guard: latency_p99: ([\d.]+) < 500
timeout: 3m
## Stop
budget: 4h | stale: 15
## Instructions
Try query batching, connection pooling, response caching.
Don't change the API contract or add dependencies.

Example: RAG pipeline

# autoimprove: better-rag-answers
## Change
scope: the RAG pipeline — chunking, retrieval, generation
exclude: data/, eval/
## Check
test: python -m pytest tests/test_pipeline.py -x
run: python eval/run_eval.py
score: answer_relevancy: ([\d.]+)
goal: higher
guard: error_rate: ([\d.]+) < 0.1
keep_if_equal: true
timeout: 5m
## Stop
budget: 6h | target: 0.92
## Instructions
Try: chunk size tuning, hybrid search, cross-encoder reranking,
query expansion, chain-of-thought generation.

Example: tabular ML

# autoimprove: better-churn-model
## Change
scope: the training pipeline
exclude: data/, evaluate.py
## Check
test: python -m pytest tests/ -x
run: python train.py && python evaluate.py
score: auc_roc: ([\d.]+)
goal: higher
guard: f1_score: ([\d.]+) > 0.6
timeout: 3m
## Stop
budget: 4h | target: 0.95
## Instructions
Try: ratio features, rolling aggregates, target encoding,
XGBoost vs LightGBM vs CatBoost, model stacking.

Goes beyond AutoML: the agent can engineer features, rewrite preprocessing, and try novel model combinations.

Real-world test: round 1

Applied to a RAG search engine: hybrid search over 44K chunks, 301 documents, 20-query golden set.

0.42
baseline
0.44
after r1
+4%
improvement
6
experiments
#ExperimentResult
1Fix keyword query sanitization (special chars crashed)kept +0.35%
2Adjust hybrid weights (0.4/0.6) + fetch limitdiscarded
3Replace weighted merge with Reciprocal Rank Fusionkept +3.6%
4Boost results appearing in both retrieval listsdiscarded (0%)
5Limit max 2 results per source for diversitykept (equal)
6Lower RRF constant (k=30) + 5x fetchdiscarded -2.1%

Real-world test: round 2

Applied the improved protocol (per-experiment commits, guards, keep_if_equal, supersedes).

0.44
after r1
0.46
after r2
+9.3%
total gain
14
total exps
#ExperimentResult
7Boost results matching query in source namediscarded -4.8%
8OR-mode keyword search (broader recall)kept +1.7%
9Fetch limit 5xdiscarded
10Better dedup key (120 chars vs 50)kept (equal)
11BM25 column weights (content=10x, name=2x)kept +0.6%
12Max 1 result per sourcediscarded
13Query-text overlap boost after RRFdiscarded -3.4%
14Fetch limit 4x (supersedes #9)kept +2.7%

What the changes actually do

OR-mode keyword search

Changed "all words must match" to "any word matches." A chunk about "growth loops" now surfaces for "how to run growth experiments" even without the word "experiments." RRF handles the noise: results matching both keyword AND semantic score highest.

BM25 column weights

Keyword matches in content (weight 10) now score 5x higher than matches in metadata (weight 2). Previously a keyword hit in the "chunk_id" column scored the same as a hit in the actual text.

Fetch limit 4x

With better keyword search (OR-mode + BM25 weights), more candidates are relevant. 4x fetch gives RRF a larger pool without drowning in noise. Guest hit rate went from 35% to 40%.

RRF (from round 1)

Replaced weighted score merge with rank-based fusion. Avoids the normalization problem of comparing BM25 scores with cosine similarity. Single biggest improvement (+3.6%).

What we learned

The eval harness is the hard part

Building the golden set and eval script took longer than running 14 experiments. Most codebases don't have a measurable score out of the box.

Tests catch real bugs

Goal-aware bootstrap caught a crash on hyphenated queries before optimization started. The agent never would have found it by tuning scores.

Per-experiment commits matter

Round 1 skipped this and lost rollback ability. Round 2 committed every experiment. Clean git reset on every discard.

Experiments interact

Fetch 5x failed in round 1 but 4x succeeded in round 2 because OR-mode keyword search changed the quality of candidates. Context matters.

Agent-agnostic

The skill runs in Claude Code. The protocol runs anywhere.

/autoimprove                                  # Claude Code (interactive)
claude -p "run /autoimprove on improve.md"    # headless overnight
/autoimprove --export                         # generates program.md
codex -p "follow program.md"                  # any agent can follow it
gemini -p "follow program.md"

Not AutoML

AutoMLautoimprove
Search spacePredefined gridOpen-ended
ChangesNumeric knobsRewrite code, try new algorithms
StrategyBayesian optimizationAI reasoning
ScopeML hyperparametersAny domain with a measurable score
CeilingBest from your gridUnbounded

The pattern works anywhere you have: a file to change, a command to run, and a number to improve.

Get started

/autoimprove    # that's it — setup is auto-guided

github.com/zanetworker/autoimprove