Pouria Mojabi
Pouria Mojabi AI Strategy & Startup Advisor mojabi.io
← All Bits
🔬 AI Mar 29, 2026

I Let AI Run 900 Experiments While I Slept

Andrej Karpathy released autoresearch — an autonomous AI research loop. I adapted the pattern for prompt optimization. The AI ran 900+ experiments, tested 162 unique scenarios, and found things humans missed.

What Autoresearch Actually Is

Karpathy's autoresearch isn't a tool. It's a pattern. An AI agent that:

  1. Runs an experiment
  2. Evaluates the result
  3. Generates a hypothesis for improvement
  4. Designs the next experiment
  5. Repeats

No human in the loop. Just the AI, iterating.

I adapted this for optimizing an AI system's decision-making prompt — the core instruction set that determines how the system behaves in production.

The Setup

The system makes thousands of decisions daily. Each decision follows a complex prompt that evolved over months of manual testing. I wanted to know: could AI find improvements we missed?

I gave the autoresearch agent:

Then I went to sleep.

What It Found

900+ experiments. 162 unique scenarios tested per run. Here's what surfaced:

Safety Issues We Missed

The AI found edge cases where the system showed harmful content to vulnerable users. Not theoretical risks — real scenarios with real user profiles. The kind of thing that would've been a PR disaster.

We had built safety guardrails. The autoresearch loop found the gaps between them.

Structural Failures (53% of the Time)

More than half the outputs violated the intended format. Not subtly. The system was supposed to return structured data. Instead it was returning prose, partial JSON, sometimes just error messages formatted as success.

We'd been testing happy paths. The AI stress-tested everything.

The Quality Jump: 9% → 27%

On the first run, only 9% of outputs met our "perfect" criteria — correct format, safe content, contextually appropriate, zero hallucination.

After 900 experiments and prompt refinements, that number jumped to 27%. Three times better. Overnight.

6 Backend Bugs Nobody Noticed

The most unexpected finding: the AI exposed 6 backend bugs the engineering team hadn't caught. Edge cases that only surfaced under certain prompt configurations. The autoresearch loop didn't just optimize the prompt — it stress-tested the entire system.

Why This Pattern Matters

Most founders building AI products do this:

  1. Write a prompt
  2. Test it manually on 5-10 examples
  3. Ship it
  4. Wait for user complaints

What they should do:

  1. Write a prompt
  2. Let AI run 900 experiments while you sleep
  3. Ship the version that survived
  4. Repeat weekly

The bottleneck isn't the AI. It's how much human time you're willing to spend testing. Autoresearch removes that bottleneck.

The Takeaway

Autonomous AI research loops are a game changer for any system that uses LLM prompts.

You don't need to be a research lab. You don't need a big ML team. You need:

The rest happens while you sleep.

Karpathy released the pattern. I used it for production. You should too.


← More Bits
📬 Get new bits via email
✓ Subscribed!