How a Silent Cron Loop Burned Through My Entire AI Token Budget in 20 Minutes

The Setup

I run OpenClaw, an AI agent orchestration layer. My setup has multiple agents — a COO agent (Pandy), a CTO agent (R20), and a bunch of automated cron jobs: morning briefings, nightly heartbeats, memory consolidation, journal writing. Standard ops for anyone running AI agents seriously.

Each agent runs on a different model. Pandy runs on Claude Opus 4.6. R20 runs on GPT-5.4. The cron jobs fire into isolated sessions so they don't step on each other.

At least, that's how it was supposed to work.

The Bug

Here's what actually happened:

Several cron jobs had a stale sessionKey pointing to my main interactive session — agent:main:main. This meant that when cron jobs fired, the gateway checked the main session's model state before launching the isolated run.

The main session had seen traffic from multiple agents running different models: Opus 4.5, Opus 4.6, Sonnet 4.5, Sonnet 4.6, GPT-5.4. The gateway's model-resolution logic saw a mismatch between the configured model and the session's last-used model, and triggered a LiveSessionModelSwitchError.

Instead of failing gracefully, it retried. And retried. And retried.

For 20 minutes straight, the gateway cycled through model fallbacks every 25-90 seconds:


opus-4-5 → opus-4-6
sonnet-4-5 → opus-4-6  
sonnet-4-6 → opus-4-6
gpt-5.4 → opus-4-6

Each retry sent the full system prompt to the API. My system prompt is ~48KB of workspace files, tool definitions, and skill registrations. At Opus pricing, ~20 retry cycles × 60k+ input tokens each = my entire daily budget gone in minutes.

No output. No useful work. Just the same prompt hitting the API wall over and over.

What Made It Worse

Three things compounded the damage:

No circuit breaker. The retry loop had no max-attempts limit. It would have run until the rate limiter or my wallet said stop.

Silent failure. No alert fired. No notification. The gateway logs showed it happening in real-time, but nothing surfaced it to me. I only noticed when Anthropic's usage dashboard showed the spike.

Fat system prompt. My system prompt had grown organically — MEMORY.md (10KB), SOUL.md (6.5KB), HEARTBEAT.md (4.7KB), TOOLS.md (4.7KB), PLAYBOOKS.md (11.6KB), plus QUEUE.md (4.6KB). That's ~48KB of workspace files alone, before tool definitions and skill registrations. Every retry sent all of it.

The Fix

Immediate:

Killed the offending cron jobs
Recreated them without the stale sessionKey binding
Verified all remaining crons use proper sessionTarget: "isolated" without main-session coupling

Structural:

Cron jobs that target isolated sessions should NEVER reference a sessionKey for the main interactive session
Multi-agent setups need model-scope isolation — R20's crons shouldn't know Pandy's model exists
System prompt needs a size budget. 48KB is bloated. Target: 20KB max.

Missing safeguards that should exist:

Max retry count on model-switch errors (3 attempts, then fail hard)
Cost anomaly alerting — if input tokens spike 10x in 5 minutes, pause and notify
Cron health dashboard showing consecutive errors (the journal cron had 4 consecutive failures before this happened — that should have been a red flag)

The Lesson

AI agent infrastructure is production infrastructure. It needs the same operational discipline as any other system:

Circuit breakers on retry loops
Cost guardrails with anomaly detection
Health monitoring that surfaces degradation before it becomes catastrophic
Prompt hygiene — treat your system prompt like a codebase, not a junk drawer

The irony is that I build AI products for a living. I've scaled systems to millions of users. And a misconfigured cron job ate my token budget while I was at dinner.

If you're running AI agents in production — especially multi-agent setups with mixed models — audit your session bindings today. The bug that gets you won't be the complex one. It'll be a stale reference in a config nobody looked at.

---