AI Agent Context Window Management: Fix LLM Context Drift

The System Was Broken

We run 4 AI agents autonomously — each wakes up every 6 hours on a heartbeat. Yesterday I found out the whole system was broken and nobody was actually doing their job.

The Mistake: Cramming Everything Into Heartbeats

The mistake: we crammed everything into the heartbeat file. Email scanning, CRM outreach, QA testing, reporting — 150 lines of instructions per agent. Looked thorough on paper.

What actually happened: 9 heartbeats fired in one session. Only 2 re-read the instruction file. The rest coasted on 174K tokens of prior conversation context — pattern-matching off procedural memory instead of following the playbook. Our ops agent spent 5 straight heartbeats re-reporting the same unresolved blocker. Zero emails scanned. Zero reports sent. Then she found a crypto wallet task buried in yesterday's notes and chased it for an hour. The instruction file said "non-negotiable." Didn't matter.

Second problem: concurrency was set to 2. All 4 agents tried to run simultaneously, only 2 got slots. Two agents hadn't executed a single heartbeat in over 24 hours. Starved out — and we didn't notice.

The Core Insight

LLMs in long-running sessions drift. They stop reading instructions and start vibing off context. The longer the session, the worse the compliance. Your beautiful instruction file becomes wallpaper after enough conversation turns.

The Architectural Fix

Heartbeats are now ~40 lines: read shared state, triage, log one learning, report, stop. No real work. No standing orders. All actual work moved to cron jobs with fresh sessions — zero prior context, no drift.

Three Rules for Building Autonomous Agents

Keep heartbeats dumb — the moment you add real work, the agent will eventually skip it.
Fresh sessions for real work — context pollution is the silent killer.
Set concurrency higher than you think — if agents can starve each other, they will.

The memory-level version of this problem is procedural amnesia — your agents forgetting how to do things they've already done. Two failure modes, one system.