Andrej Karpathy released autoresearch — an autonomous AI research loop. I adapted the pattern for prompt optimization. The AI ran 900+ experiments, tested 162 unique scenarios, and found things humans missed.
Karpathy's autoresearch isn't a tool. It's a pattern. An AI agent that:
No human in the loop. Just the AI, iterating.
I adapted this for optimizing an AI system's decision-making prompt — the core instruction set that determines how the system behaves in production.
The system makes thousands of decisions daily. Each decision follows a complex prompt that evolved over months of manual testing. I wanted to know: could AI find improvements we missed?
I gave the autoresearch agent:
Then I went to sleep.
900+ experiments. 162 unique scenarios tested per run. Here's what surfaced:
The AI found edge cases where the system showed harmful content to vulnerable users. Not theoretical risks — real scenarios with real user profiles. The kind of thing that would've been a PR disaster.
We had built safety guardrails. The autoresearch loop found the gaps between them.
More than half the outputs violated the intended format. Not subtly. The system was supposed to return structured data. Instead it was returning prose, partial JSON, sometimes just error messages formatted as success.
We'd been testing happy paths. The AI stress-tested everything.
On the first run, only 9% of outputs met our "perfect" criteria — correct format, safe content, contextually appropriate, zero hallucination.
After 900 experiments and prompt refinements, that number jumped to 27%. Three times better. Overnight.
The most unexpected finding: the AI exposed 6 backend bugs the engineering team hadn't caught. Edge cases that only surfaced under certain prompt configurations. The autoresearch loop didn't just optimize the prompt — it stress-tested the entire system.
Most founders building AI products do this:
What they should do:
The bottleneck isn't the AI. It's how much human time you're willing to spend testing. Autoresearch removes that bottleneck.
Autonomous AI research loops are a game changer for any system that uses LLM prompts.
You don't need to be a research lab. You don't need a big ML team. You need:
The rest happens while you sleep.
Karpathy released the pattern. I used it for production. You should too.