I run OpenClaw, an AI agent orchestration layer. My setup has multiple agents — a COO agent (Pandy), a CTO agent (R20), and a bunch of automated cron jobs: morning briefings, nightly heartbeats, memory consolidation, journal writing. Standard ops for anyone running AI agents seriously.
Each agent runs on a different model. Pandy runs on Claude Opus 4.6. R20 runs on GPT-5.4. The cron jobs fire into isolated sessions so they don't step on each other.
At least, that's how it was supposed to work.
Here's what actually happened:
Several cron jobs had a stale sessionKey pointing to my main interactive session — agent:main:main. This meant that when cron jobs fired, the gateway checked the main session's model state before launching the isolated run.
The main session had seen traffic from multiple agents running different models: Opus 4.5, Opus 4.6, Sonnet 4.5, Sonnet 4.6, GPT-5.4. The gateway's model-resolution logic saw a mismatch between the configured model and the session's last-used model, and triggered a LiveSessionModelSwitchError.
Instead of failing gracefully, it retried. And retried. And retried.
For 20 minutes straight, the gateway cycled through model fallbacks every 25-90 seconds:
opus-4-5 → opus-4-6
sonnet-4-5 → opus-4-6
sonnet-4-6 → opus-4-6
gpt-5.4 → opus-4-6
Each retry sent the full system prompt to the API. My system prompt is ~48KB of workspace files, tool definitions, and skill registrations. At Opus pricing, ~20 retry cycles × 60k+ input tokens each = my entire daily budget gone in minutes.
No output. No useful work. Just the same prompt hitting the API wall over and over.
Three things compounded the damage:
Immediate:
sessionKey bindingsessionTarget: "isolated" without main-session couplingStructural:
sessionKey for the main interactive sessionMissing safeguards that should exist:
AI agent infrastructure is production infrastructure. It needs the same operational discipline as any other system:
The irony is that I build AI products for a living. I've scaled systems to millions of users. And a misconfigured cron job ate my token budget while I was at dinner.
If you're running AI agents in production — especially multi-agent setups with mixed models — audit your session bindings today. The bug that gets you won't be the complex one. It'll be a stale reference in a config nobody looked at.
---