Software Engineer · Java · Spring Boot · Microservices
Token Budget Management in Long-Running AI Agents: Preventing Context Exhaustion
The context window is the single most critical resource constraint in any production AI agent system. Unlike memory, disk, or CPU — where exhaustion triggers explicit errors — context window exhaustion is silent and insidious. The model doesn't crash. It hallucinates. It forgets earlier tool outputs. It starts fabricating file contents it never read. Building agents that survive beyond trivial task sizes requires treating the token budget with the same rigor you'd apply to heap memory or database connections.
Table of Contents
1. The Context Exhaustion Crisis
Your code review agent works beautifully in demos. It analyzes a 10-file pull request, identifies a null pointer risk in the service layer, flags a missing index on a foreign key, and produces a crisp summary. Then a real engineer submits a 200-file refactoring PR to rename a base package across the entire codebase. The agent starts reading files. Forty files in, the conversation history is already 60,000 tokens. Eighty files in, the model is approaching its limit. The agent keeps running — it doesn't know it's in trouble — but its output degrades catastrophically.
The failure mode is compounded by the fact that most LLM APIs don't return a warning when you're approaching the context limit. They silently truncate the oldest tokens or — in the case of models with hard limits — return a 400 context_length_exceeded error that breaks the agent's tool-use loop entirely. Neither behavior is acceptable in an autonomous agent that may be running an hour-long task. The solution is to manage the token budget proactively, inside your agent orchestration layer, before the model ever sees an oversized context.
2. Understanding Token Economics
Tokens are not bytes. A typical English word is approximately 1.3 tokens in GPT-4-class tokenizers. Source code is denser — Java and Python average 1.5–2.5 tokens per word because of camelCase identifiers, symbols, and indentation. JSON responses from API tools are particularly expensive: a moderately sized JSON blob of 500 bytes might consume 200–300 tokens depending on key names and nesting depth.
Context windows vary significantly across frontier models, and bigger is not unconditionally better:
# Context window sizes and approximate costs (as of Q1 2026)
# Input / Output tokens per 1M tokens
GPT-4o 128,000 tokens $2.50 in / $10.00 out
GPT-4o-mini 128,000 tokens $0.15 in / $0.60 out
Claude 3.5 Sonnet 200,000 tokens $3.00 in / $15.00 out
Claude 3 Haiku 200,000 tokens $0.25 in / $1.25 out
Gemini 2.0 Flash 1,000,000 tokens $0.075 in / $0.30 out
Gemini 2.0 Pro 1,000,000 tokens $1.25 in / $5.00 out
# A 100,000-token context on GPT-4o costs $0.25 per call just in input tokens.
# An agent loop with 20 iterations on that same context costs $5.00 in input alone.
The key insight around KV cache economics is often missed: when you reuse the same prefix across multiple LLM calls (system prompt + stable conversation history), providers like Anthropic and OpenAI cache the KV attention states for that prefix and charge you at a heavily discounted rate (typically 10-25% of the normal input token price) for cache hits. This means your context compression strategy should try to keep stable content at the top of the context and variable content — new tool results, latest user messages — at the bottom, maximizing cache hit rates and dramatically reducing per-call costs for long-running agents.
A 200,000-token context on Claude 3.5 Sonnet sounds unlimited until you calculate the economics of a 50-iteration agentic loop: 50 × $0.60 per call = $30 per task run, before even counting output tokens. For an enterprise agent processing hundreds of tasks per day, token budget management is not just a reliability concern — it's a fundamental cost control mechanism.
3. Token Budget Architecture
A robust token budget architecture pre-allocates the context window into distinct zones with hard limits, rather than letting content grow organically until it hits the model's limit. Think of it like memory segments: system, stack, heap — each serving a different purpose with different lifecycle characteristics.
# Context window allocation for a 128k-token model (GPT-4o example)
# Total budget: 128,000 tokens
SYSTEM_PROMPT_RESERVE = 8_000 # 6.25% - agent persona, tools schema, rules
TASK_DESCRIPTION = 4_000 # 3.1% - current task context
CONVERSATION_HISTORY = 40_000 # 31.2% - rolling window of past turns
TOOL_RESPONSES_BUFFER = 50_000 # 39.1% - tool call inputs and outputs
GENERATION_RESERVE = 16_000 # 12.5% - reserved for model output tokens
SAFETY_MARGIN = 10_000 # 7.8% - never fill above 118k total
# Watermark thresholds
WARN_WATERMARK = 0.70 # 89,600 tokens - start compressing history
CRITICAL_WATERMARK = 0.90 # 115,200 tokens - aggressive compression
HARD_LIMIT = 0.95 # 121,600 tokens - pause and checkpoint
The most important design decision is the generation reserve. Many agent frameworks fill the entire context window with input and let the model generate into whatever space remains. This is dangerous: if a tool response unexpectedly consumes an extra 5,000 tokens, your model may only have 2,000 tokens left for output — not enough to produce a structured tool call response, causing malformed JSON and loop failures. Always reserve a fixed block for generation, and treat that block as unallocatable for input.
Token counting before sending is a non-negotiable prerequisite. Use the provider's tokenizer library: tiktoken for OpenAI models, anthropic-tokenizer (or Anthropic's token counting endpoint) for Claude, and google-generativeai's built-in counting for Gemini. Never estimate based on character count — tokenization is model-specific and can differ by 30-40% for code-heavy content.
import tiktoken
def count_tokens_openai(messages: list[dict], model: str = "gpt-4o") -> int:
enc = tiktoken.encoding_for_model(model)
total = 0
for msg in messages:
# Per-message overhead: 4 tokens for role/content framing
total += 4
for key, value in msg.items():
total += len(enc.encode(str(value)))
if key == "name":
total += 1 # name field adds 1 token
total += 2 # reply priming tokens
return total
def count_tokens_for_tool_response(tool_result: str, model: str = "gpt-4o") -> int:
enc = tiktoken.encoding_for_model(model)
# Tool responses have ~6 tokens overhead for role/tool_call_id framing
return len(enc.encode(tool_result)) + 6
4. Context Compression Strategies
When the context budget hits the warn watermark, compression must begin. The strategies differ by content type and retrieval requirements:
Conversation history summarization: When the rolling conversation window exceeds its allocation, replace the oldest N turns with a compact summary generated by a smaller, cheaper model (GPT-4o-mini or Claude Haiku). The summary should preserve key decisions made, facts established, and errors encountered — not a verbatim replay. A 10-turn conversation consuming 8,000 tokens typically summarizes to 400–600 tokens with no meaningful information loss for the agent's current task.
def summarize_old_turns(
turns: list[dict],
n_turns_to_compress: int,
summarizer_model: str = "gpt-4o-mini"
) -> str:
old_turns = turns[:n_turns_to_compress]
conversation_text = "\n".join(
f"{t['role'].upper()}: {t['content']}" for t in old_turns
)
response = openai_client.chat.completions.create(
model=summarizer_model,
messages=[{
"role": "system",
"content": (
"Summarize this agent conversation in under 300 words. "
"Preserve: decisions made, files read, errors encountered, "
"key facts established. Discard: verbose reasoning, "
"redundant tool outputs, exploratory dead ends."
)
}, {
"role": "user",
"content": conversation_text
}],
max_tokens=400
)
return response.choices[0].message.content
Selective memory retrieval: Instead of maintaining a full conversation history, persist all tool results and intermediate findings in a vector store (Pinecone, pgvector, or Chroma). Before each LLM call, semantically retrieve only the top-K most relevant past results for the current step. This transforms the context from a linear accumulator into a relevance-gated working memory. An agent that has read 100 files no longer carries all 100 files in context — it carries only the 5–10 most relevant to the current question being answered.
Tool response compression: API tool responses are often the biggest single consumers of context tokens. A list_files tool returning a 2,000-file directory tree as JSON consumes 15,000+ tokens when only 10 files are relevant. Apply response compression at the tool boundary:
def compress_tool_response(
tool_name: str,
raw_response: str,
max_tokens: int = 2000
) -> str:
token_count = count_tokens_openai([{"role": "user", "content": raw_response}])
if token_count <= max_tokens:
return raw_response
if tool_name == "read_file":
# Truncate file content with a clear marker
lines = raw_response.split("\n")
half = len(lines) // 2
return (
"\n".join(lines[:half])
+ f"\n... [TRUNCATED: {token_count - max_tokens} tokens omitted] ..."
+ "\n".join(lines[-20:]) # always keep last 20 lines
)
elif tool_name in ("search_results", "api_response"):
# Keep first max_tokens worth of content
enc = tiktoken.encoding_for_model("gpt-4o")
tokens = enc.encode(raw_response)
truncated = enc.decode(tokens[:max_tokens])
return truncated + f"\n[TRUNCATED: showing {max_tokens}/{token_count} tokens]"
return raw_response
Message windowing: When all else fails, implement a sliding window that keeps only the last N turns of the conversation plus the original task description. This is the most aggressive strategy and should be a last resort, as it loses intermediate reasoning steps. Use it only in the critical watermark zone (90%+ context usage) while simultaneously persisting the dropped turns to external storage for potential re-injection.
5. Implementation Patterns
A production-grade TokenBudgetManager sits between your agent's orchestration loop and the LLM API call. It intercepts every call, measures the current context size, applies the appropriate compression strategy, and emits metrics for observability:
from dataclasses import dataclass, field
from enum import Enum
import tiktoken
class BudgetStatus(Enum):
HEALTHY = "healthy" # < 70% used
WARNING = "warning" # 70-90% used — begin compression
CRITICAL = "critical" # 90-95% used — aggressive compression
EXHAUSTED = "exhausted" # >= 95% used — must checkpoint/pause
@dataclass
class TokenBudget:
model: str
total_context: int
system_reserve: int = 8_000
generation_reserve: int = 16_000
safety_margin: int = 10_000
@property
def usable_tokens(self) -> int:
return self.total_context - self.system_reserve \
- self.generation_reserve - self.safety_margin
class TokenBudgetManager:
def __init__(self, budget: TokenBudget):
self.budget = budget
self.encoder = tiktoken.encoding_for_model(budget.model)
self._compression_callbacks = []
def count(self, messages: list[dict]) -> int:
total = 2 # priming
for msg in messages:
total += 4 # per-message overhead
for value in msg.values():
total += len(self.encoder.encode(str(value)))
return total
def status(self, messages: list[dict]) -> BudgetStatus:
used = self.count(messages)
ratio = used / self.budget.usable_tokens
if ratio < 0.70: return BudgetStatus.HEALTHY
if ratio < 0.90: return BudgetStatus.WARNING
if ratio < 0.95: return BudgetStatus.CRITICAL
return BudgetStatus.EXHAUSTED
def prepare_messages(
self,
messages: list[dict],
vector_store=None
) -> list[dict]:
status = self.status(messages)
if status == BudgetStatus.WARNING:
messages = self._summarize_history(messages, turns_to_compress=5)
elif status == BudgetStatus.CRITICAL:
messages = self._summarize_history(messages, turns_to_compress=10)
messages = self._compress_tool_responses(messages)
elif status == BudgetStatus.EXHAUSTED:
messages = self._checkpoint_and_window(messages, keep_last=8)
return messages
def _summarize_history(self, messages, turns_to_compress):
# Compress oldest assistant+user turns into a summary message
non_system = [m for m in messages if m["role"] != "system"]
to_compress = non_system[:turns_to_compress * 2]
summary = summarize_old_turns(
to_compress, len(to_compress), "gpt-4o-mini")
system_msgs = [m for m in messages if m["role"] == "system"]
remaining = [m for m in non_system[turns_to_compress * 2:]]
summary_msg = {
"role": "user",
"content": f"[CONVERSATION SUMMARY - earlier context compressed]\n{summary}"
}
return system_msgs + [summary_msg] + remaining
def _compress_tool_responses(self, messages):
compressed = []
for msg in messages:
if msg.get("role") == "tool" and len(msg.get("content", "")) > 3000:
msg = {**msg, "content": msg["content"][:3000] + "\n[TRUNCATED]"}
compressed.append(msg)
return compressed
def _checkpoint_and_window(self, messages, keep_last):
# Persist full history externally, keep only last N turns + system
system_msgs = [m for m in messages if m["role"] == "system"]
recent_turns = [m for m in messages if m["role"] != "system"][-keep_last:]
return system_msgs + recent_turns
The checkpoint and resume pattern is essential for tasks that genuinely exceed any context window. Before entering the EXHAUSTED zone, serialize the full agent state — task description, completed steps, pending steps, all tool results — to durable storage (S3, Redis, or a task database). Resume by reconstructing a fresh context with only the essential state: system prompt, task definition, a structured summary of completed work, and the next pending action. This pattern enables agents to work on arbitrarily long tasks without being bounded by the context window at all.
6. Failure Scenarios and Recovery
Truncated reasoning chains: The most common failure mode in near-exhausted contexts is that the model's chain-of-thought reasoning gets compressed by the context limit. The model starts generating conclusions without the intermediate steps, producing confident but ungrounded outputs. You can detect this by monitoring the ratio of reasoning tokens to answer tokens in the model's output — a sharp drop in reasoning-to-answer ratio is a reliable signal of context saturation.
Hallucinated tool calls: When the model loses track of which tools it has already called and what they returned, it begins generating tool calls for information it already has (or thinks it has, from hallucinated memory). Implement an idempotent tool call tracker that deduplicates tool calls by signature and returns cached results, preventing the agent from issuing the same expensive API call five times in one loop.
# Detecting context saturation via output quality metrics
def detect_context_saturation(response: str, tool_calls: list) -> bool:
# Signal 1: very short reasoning before tool calls (model skipping CoT)
reasoning_tokens = count_tokens_before_first_tool_call(response, tool_calls)
if reasoning_tokens < 50 and len(tool_calls) > 0:
return True
# Signal 2: repeated tool calls for same resource
call_signatures = [f"{c.name}:{sorted(c.arguments.items())}" for c in tool_calls]
if len(call_signatures) != len(set(call_signatures)):
return True
# Signal 3: model references information not in current context
# (requires external fact-checking, simplified here)
return False
Graceful degradation patterns: Rather than failing hard when the context is exhausted, implement tiered degradation. At WARNING level, compress history but continue normally. At CRITICAL level, switch to a model with a larger context window for the current call only (e.g., promote from GPT-4o-mini to Gemini 2.0 Flash with 1M context). At EXHAUSTED level, pause the agent, checkpoint state, notify the orchestrator, and either resume in a fresh context or escalate to a human operator.
7. Cost Optimization
Smart model routing within an agent loop can reduce costs by 60-80% without sacrificing quality. Not every step in an agent loop requires the frontier model. Classify steps by complexity: simple retrieval steps (reading files, listing directories, pattern matching) can run on GPT-4o-mini at 1/16th the cost. Complex reasoning steps (synthesizing findings, writing code, making architecture decisions) warrant the full model. Implement a step classifier that routes based on task type:
STEP_TYPE_MODEL_MAP = {
"file_read": "gpt-4o-mini", # $0.15/1M in
"search": "gpt-4o-mini", # $0.15/1M in
"data_extraction": "gpt-4o-mini", # $0.15/1M in
"code_generation": "gpt-4o", # $2.50/1M in
"architecture_review":"claude-3-5-sonnet", # $3.00/1M in
"security_analysis": "gpt-4o", # $2.50/1M in
"final_synthesis": "gpt-4o", # $2.50/1M in
}
def route_model_for_step(step_type: str, context_tokens: int) -> str:
# Escalate to large-context model if needed, regardless of step type
if context_tokens > 100_000:
return "gemini-2.0-flash" # 1M context at $0.075/1M
return STEP_TYPE_MODEL_MAP.get(step_type, "gpt-4o-mini")
Prompt caching is the highest-leverage cost optimization for agents with stable system prompts. Anthropic's prompt caching (available on Claude 3.5 Sonnet and Haiku) caches any prefix over 1,024 tokens at 90% discount for subsequent calls within a 5-minute window. For an agent with a 5,000-token system prompt running 30 steps, prompt caching alone saves ~60% of system prompt input costs. Mark your system prompt with the cache_control: {"type": "ephemeral"} header and structure your messages so the system prompt and task description always appear before variable content.
Measuring cost per agent run should be treated as a first-class metric alongside latency and success rate. Instrument every LLM call with token counts, model used, cache hit/miss status, and step type. Aggregate to per-task cost, and set cost budgets (hard limits) per task type. An agent that achieves 95% task success at $0.08 per run is a product you can scale. The same agent at $4.50 per run is a prototype.
"A context window is not infinite RAM. It is a working memory with a hard limit, and the model's coherence degrades long before you hit that limit. The engineering discipline is to treat every token as a resource to be budgeted, allocated, and reclaimed — not as free space to be filled until something breaks."
— Reflecting production patterns from enterprise LLM deployments
Key Takeaways
- Context exhaustion is silent — models hallucinate before they crash; production agents must budget proactively rather than react to errors.
- Pre-allocate context zones — divide the window into system, history, tool buffer, and generation reserve; never fill more than 95% of total capacity.
- Count tokens before every call — use model-specific tokenizers (tiktoken, anthropic tokenizer) and monitor watermarks at 70% and 90% to trigger compression.
- Compress at the tool boundary — truncate oversized tool responses before they enter the context; this is the highest-yield compression intervention.
- Checkpoint and resume for long tasks — serialize agent state to durable storage at critical watermark and reconstruct a fresh context; context windows are not a ceiling on task length.
- Smart model routing cuts costs 60-80% — use cheaper models for retrieval steps, frontier models only for synthesis; implement prompt caching for stable system prompts.
Conclusion
Token budget management is the infrastructure layer that separates demo-grade AI agents from production-grade ones. The code review agent that works on 10 files and fails silently on 200 is not a buggy AI — it is an unmanaged resource consumer hitting its natural limit. By implementing budget zones, watermark-triggered compression, checkpoint/resume patterns, and smart model routing, you make the agent's behavior predictable and bounded regardless of task size.
The mental model shift is critical: stop thinking of the context window as available space to fill, and start treating every token as a resource to allocate, compress, or reclaim. An agent architecture built on this foundation scales from 10-file PRs to million-token codebases with consistent quality, predictable costs, and observable behavior — the three properties every production system demands.
Discussion / Comments
Related Posts
AI Memory Management Patterns
Short-term, episodic, and semantic memory architectures for long-horizon AI agents.
Agentic RAG in Production
Build adaptive retrieval pipelines where agents decide what to retrieve, when, and how much.
LLMOps in Production
Operationalize LLM-powered systems with evaluation pipelines, cost tracking, and deployment guardrails.
Last updated: March 2026 — Written by Md Sanwar Hossain