I Let AI Run 900 Experiments While I Slept

Andrej Karpathy released autoresearch — an autonomous AI research loop. I adapted the pattern for prompt optimization. The AI ran 900+ experiments, tested 162 unique scenarios, and found things humans missed.

What Autoresearch Actually Is

Karpathy's autoresearch isn't a tool. It's a pattern. An AI agent that:

Runs an experiment
Evaluates the result
Generates a hypothesis for improvement
Designs the next experiment
Repeats

No human in the loop. Just the AI, iterating.

I adapted this for optimizing an AI system's decision-making prompt — the core instruction set that determines how the system behaves in production.

The Setup

The system makes thousands of decisions daily. Each decision follows a complex prompt that evolved over months of manual testing. I wanted to know: could AI find improvements we missed?

I gave the autoresearch agent:

The current prompt
A test suite of 162 real-world scenarios
Clear success criteria
Permission to run autonomously

Then I went to sleep.

What It Found

900+ experiments. 162 unique scenarios tested per run. Here's what surfaced:

Safety Issues We Missed

The AI found edge cases where the system showed harmful content to vulnerable users. Not theoretical risks — real scenarios with real user profiles. The kind of thing that would've been a PR disaster.

We had built safety guardrails. The autoresearch loop found the gaps between them.

Structural Failures (53% of the Time)

More than half the outputs violated the intended format. Not subtly. The system was supposed to return structured data. Instead it was returning prose, partial JSON, sometimes just error messages formatted as success.

We'd been testing happy paths. The AI stress-tested everything.

The Quality Jump: 9% → 27%

On the first run, only 9% of outputs met our "perfect" criteria — correct format, safe content, contextually appropriate, zero hallucination.

After 900 experiments and prompt refinements, that number jumped to 27%. Three times better. Overnight.

6 Backend Bugs Nobody Noticed

The most unexpected finding: the AI exposed 6 backend bugs the engineering team hadn't caught. Edge cases that only surfaced under certain prompt configurations. The autoresearch loop didn't just optimize the prompt — it stress-tested the entire system.

Why This Pattern Matters

Most founders building AI products do this:

Write a prompt
Test it manually on 5-10 examples
Ship it
Wait for user complaints

What they should do:

Write a prompt
Let AI run 900 experiments while you sleep
Ship the version that survived
Repeat weekly

The bottleneck isn't the AI. It's how much human time you're willing to spend testing. Autoresearch removes that bottleneck.

The Takeaway

Autonomous AI research loops are a game changer for any system that uses LLM prompts.

You don't need to be a research lab. You don't need a big ML team. You need:

A clear success metric
A test suite of real scenarios
An AI agent with permission to iterate

The rest happens while you sleep.

Karpathy released the pattern. I used it for production. You should too.