Token budget management for AI agents - context window and LLM optimization
Md Sanwar Hossain - Software Engineer
Md Sanwar Hossain

Software Engineer · Java · Spring Boot · Microservices

Agentic AI March 19, 2026 17 min read Agentic AI Production Series

Token Budget Management in Long-Running AI Agents: Preventing Context Exhaustion

The context window is the single most critical resource constraint in any production AI agent system. Unlike memory, disk, or CPU — where exhaustion triggers explicit errors — context window exhaustion is silent and insidious. The model doesn't crash. It hallucinates. It forgets earlier tool outputs. It starts fabricating file contents it never read. Building agents that survive beyond trivial task sizes requires treating the token budget with the same rigor you'd apply to heap memory or database connections.

Table of Contents

  1. The Context Exhaustion Crisis
  2. Understanding Token Economics
  3. Token Budget Architecture
  4. Context Compression Strategies
  5. Implementation Patterns
  6. Failure Scenarios and Recovery
  7. Cost Optimization
  8. Key Takeaways
  9. Conclusion

1. The Context Exhaustion Crisis

Your code review agent works beautifully in demos. It analyzes a 10-file pull request, identifies a null pointer risk in the service layer, flags a missing index on a foreign key, and produces a crisp summary. Then a real engineer submits a 200-file refactoring PR to rename a base package across the entire codebase. The agent starts reading files. Forty files in, the conversation history is already 60,000 tokens. Eighty files in, the model is approaching its limit. The agent keeps running — it doesn't know it's in trouble — but its output degrades catastrophically.

Real scenario: A code review agent processing a 200-file PR initially performs well, correctly identifying issues in the first 60 files. By file 100, the tool call responses from earlier file reads are still in context, consuming ~80,000 tokens. By file 140, the model begins hallucinating — generating comments about code patterns it never actually read, confabulating function signatures from file names alone, and confidently flagging non-existent bugs. The agent completes the review with a 100% success status code. The developer trusts the output. Two real bugs get missed. One hallucinated bug triggers a 45-minute investigation.

The failure mode is compounded by the fact that most LLM APIs don't return a warning when you're approaching the context limit. They silently truncate the oldest tokens or — in the case of models with hard limits — return a 400 context_length_exceeded error that breaks the agent's tool-use loop entirely. Neither behavior is acceptable in an autonomous agent that may be running an hour-long task. The solution is to manage the token budget proactively, inside your agent orchestration layer, before the model ever sees an oversized context.

2. Understanding Token Economics

Tokens are not bytes. A typical English word is approximately 1.3 tokens in GPT-4-class tokenizers. Source code is denser — Java and Python average 1.5–2.5 tokens per word because of camelCase identifiers, symbols, and indentation. JSON responses from API tools are particularly expensive: a moderately sized JSON blob of 500 bytes might consume 200–300 tokens depending on key names and nesting depth.

Context windows vary significantly across frontier models, and bigger is not unconditionally better:

# Context window sizes and approximate costs (as of Q1 2026)
# Input / Output tokens per 1M tokens

GPT-4o             128,000 tokens    $2.50 in  / $10.00 out
GPT-4o-mini         128,000 tokens    $0.15 in  / $0.60  out
Claude 3.5 Sonnet  200,000 tokens    $3.00 in  / $15.00 out
Claude 3 Haiku     200,000 tokens    $0.25 in  / $1.25  out
Gemini 2.0 Flash 1,000,000 tokens   $0.075 in / $0.30  out
Gemini 2.0 Pro   1,000,000 tokens   $1.25 in  / $5.00  out

# A 100,000-token context on GPT-4o costs $0.25 per call just in input tokens.
# An agent loop with 20 iterations on that same context costs $5.00 in input alone.

The key insight around KV cache economics is often missed: when you reuse the same prefix across multiple LLM calls (system prompt + stable conversation history), providers like Anthropic and OpenAI cache the KV attention states for that prefix and charge you at a heavily discounted rate (typically 10-25% of the normal input token price) for cache hits. This means your context compression strategy should try to keep stable content at the top of the context and variable content — new tool results, latest user messages — at the bottom, maximizing cache hit rates and dramatically reducing per-call costs for long-running agents.

A 200,000-token context on Claude 3.5 Sonnet sounds unlimited until you calculate the economics of a 50-iteration agentic loop: 50 × $0.60 per call = $30 per task run, before even counting output tokens. For an enterprise agent processing hundreds of tasks per day, token budget management is not just a reliability concern — it's a fundamental cost control mechanism.

3. Token Budget Architecture

A robust token budget architecture pre-allocates the context window into distinct zones with hard limits, rather than letting content grow organically until it hits the model's limit. Think of it like memory segments: system, stack, heap — each serving a different purpose with different lifecycle characteristics.

# Context window allocation for a 128k-token model (GPT-4o example)
# Total budget: 128,000 tokens

SYSTEM_PROMPT_RESERVE    =  8_000   # 6.25% - agent persona, tools schema, rules
TASK_DESCRIPTION         =  4_000   # 3.1%  - current task context
CONVERSATION_HISTORY     = 40_000   # 31.2% - rolling window of past turns
TOOL_RESPONSES_BUFFER    = 50_000   # 39.1% - tool call inputs and outputs
GENERATION_RESERVE       = 16_000   # 12.5% - reserved for model output tokens
SAFETY_MARGIN            = 10_000   # 7.8%  - never fill above 118k total

# Watermark thresholds
WARN_WATERMARK           = 0.70    # 89,600 tokens - start compressing history
CRITICAL_WATERMARK       = 0.90    # 115,200 tokens - aggressive compression
HARD_LIMIT               = 0.95    # 121,600 tokens - pause and checkpoint

The most important design decision is the generation reserve. Many agent frameworks fill the entire context window with input and let the model generate into whatever space remains. This is dangerous: if a tool response unexpectedly consumes an extra 5,000 tokens, your model may only have 2,000 tokens left for output — not enough to produce a structured tool call response, causing malformed JSON and loop failures. Always reserve a fixed block for generation, and treat that block as unallocatable for input.

Token counting before sending is a non-negotiable prerequisite. Use the provider's tokenizer library: tiktoken for OpenAI models, anthropic-tokenizer (or Anthropic's token counting endpoint) for Claude, and google-generativeai's built-in counting for Gemini. Never estimate based on character count — tokenization is model-specific and can differ by 30-40% for code-heavy content.

import tiktoken

def count_tokens_openai(messages: list[dict], model: str = "gpt-4o") -> int:
    enc = tiktoken.encoding_for_model(model)
    total = 0
    for msg in messages:
        # Per-message overhead: 4 tokens for role/content framing
        total += 4
        for key, value in msg.items():
            total += len(enc.encode(str(value)))
            if key == "name":
                total += 1  # name field adds 1 token
    total += 2  # reply priming tokens
    return total

def count_tokens_for_tool_response(tool_result: str, model: str = "gpt-4o") -> int:
    enc = tiktoken.encoding_for_model(model)
    # Tool responses have ~6 tokens overhead for role/tool_call_id framing
    return len(enc.encode(tool_result)) + 6

4. Context Compression Strategies

When the context budget hits the warn watermark, compression must begin. The strategies differ by content type and retrieval requirements:

Conversation history summarization: When the rolling conversation window exceeds its allocation, replace the oldest N turns with a compact summary generated by a smaller, cheaper model (GPT-4o-mini or Claude Haiku). The summary should preserve key decisions made, facts established, and errors encountered — not a verbatim replay. A 10-turn conversation consuming 8,000 tokens typically summarizes to 400–600 tokens with no meaningful information loss for the agent's current task.

def summarize_old_turns(
    turns: list[dict],
    n_turns_to_compress: int,
    summarizer_model: str = "gpt-4o-mini"
) -> str:
    old_turns = turns[:n_turns_to_compress]
    conversation_text = "\n".join(
        f"{t['role'].upper()}: {t['content']}" for t in old_turns
    )
    response = openai_client.chat.completions.create(
        model=summarizer_model,
        messages=[{
            "role": "system",
            "content": (
                "Summarize this agent conversation in under 300 words. "
                "Preserve: decisions made, files read, errors encountered, "
                "key facts established. Discard: verbose reasoning, "
                "redundant tool outputs, exploratory dead ends."
            )
        }, {
            "role": "user",
            "content": conversation_text
        }],
        max_tokens=400
    )
    return response.choices[0].message.content

Selective memory retrieval: Instead of maintaining a full conversation history, persist all tool results and intermediate findings in a vector store (Pinecone, pgvector, or Chroma). Before each LLM call, semantically retrieve only the top-K most relevant past results for the current step. This transforms the context from a linear accumulator into a relevance-gated working memory. An agent that has read 100 files no longer carries all 100 files in context — it carries only the 5–10 most relevant to the current question being answered.

Tool response compression: API tool responses are often the biggest single consumers of context tokens. A list_files tool returning a 2,000-file directory tree as JSON consumes 15,000+ tokens when only 10 files are relevant. Apply response compression at the tool boundary:

def compress_tool_response(
    tool_name: str,
    raw_response: str,
    max_tokens: int = 2000
) -> str:
    token_count = count_tokens_openai([{"role": "user", "content": raw_response}])
    if token_count <= max_tokens:
        return raw_response

    if tool_name == "read_file":
        # Truncate file content with a clear marker
        lines = raw_response.split("\n")
        half = len(lines) // 2
        return (
            "\n".join(lines[:half])
            + f"\n... [TRUNCATED: {token_count - max_tokens} tokens omitted] ..."
            + "\n".join(lines[-20:])  # always keep last 20 lines
        )
    elif tool_name in ("search_results", "api_response"):
        # Keep first max_tokens worth of content
        enc = tiktoken.encoding_for_model("gpt-4o")
        tokens = enc.encode(raw_response)
        truncated = enc.decode(tokens[:max_tokens])
        return truncated + f"\n[TRUNCATED: showing {max_tokens}/{token_count} tokens]"
    return raw_response

Message windowing: When all else fails, implement a sliding window that keeps only the last N turns of the conversation plus the original task description. This is the most aggressive strategy and should be a last resort, as it loses intermediate reasoning steps. Use it only in the critical watermark zone (90%+ context usage) while simultaneously persisting the dropped turns to external storage for potential re-injection.

5. Implementation Patterns

A production-grade TokenBudgetManager sits between your agent's orchestration loop and the LLM API call. It intercepts every call, measures the current context size, applies the appropriate compression strategy, and emits metrics for observability:

from dataclasses import dataclass, field
from enum import Enum
import tiktoken

class BudgetStatus(Enum):
    HEALTHY   = "healthy"    # < 70% used
    WARNING   = "warning"    # 70-90% used — begin compression
    CRITICAL  = "critical"   # 90-95% used — aggressive compression
    EXHAUSTED = "exhausted"  # >= 95% used — must checkpoint/pause

@dataclass
class TokenBudget:
    model: str
    total_context: int
    system_reserve: int = 8_000
    generation_reserve: int = 16_000
    safety_margin: int = 10_000

    @property
    def usable_tokens(self) -> int:
        return self.total_context - self.system_reserve \
               - self.generation_reserve - self.safety_margin

class TokenBudgetManager:
    def __init__(self, budget: TokenBudget):
        self.budget = budget
        self.encoder = tiktoken.encoding_for_model(budget.model)
        self._compression_callbacks = []

    def count(self, messages: list[dict]) -> int:
        total = 2  # priming
        for msg in messages:
            total += 4  # per-message overhead
            for value in msg.values():
                total += len(self.encoder.encode(str(value)))
        return total

    def status(self, messages: list[dict]) -> BudgetStatus:
        used = self.count(messages)
        ratio = used / self.budget.usable_tokens
        if ratio < 0.70:  return BudgetStatus.HEALTHY
        if ratio < 0.90:  return BudgetStatus.WARNING
        if ratio < 0.95:  return BudgetStatus.CRITICAL
        return BudgetStatus.EXHAUSTED

    def prepare_messages(
        self,
        messages: list[dict],
        vector_store=None
    ) -> list[dict]:
        status = self.status(messages)

        if status == BudgetStatus.WARNING:
            messages = self._summarize_history(messages, turns_to_compress=5)
        elif status == BudgetStatus.CRITICAL:
            messages = self._summarize_history(messages, turns_to_compress=10)
            messages = self._compress_tool_responses(messages)
        elif status == BudgetStatus.EXHAUSTED:
            messages = self._checkpoint_and_window(messages, keep_last=8)

        return messages

    def _summarize_history(self, messages, turns_to_compress):
        # Compress oldest assistant+user turns into a summary message
        non_system = [m for m in messages if m["role"] != "system"]
        to_compress = non_system[:turns_to_compress * 2]
        summary = summarize_old_turns(
            to_compress, len(to_compress), "gpt-4o-mini")
        system_msgs = [m for m in messages if m["role"] == "system"]
        remaining   = [m for m in non_system[turns_to_compress * 2:]]
        summary_msg = {
            "role": "user",
            "content": f"[CONVERSATION SUMMARY - earlier context compressed]\n{summary}"
        }
        return system_msgs + [summary_msg] + remaining

    def _compress_tool_responses(self, messages):
        compressed = []
        for msg in messages:
            if msg.get("role") == "tool" and len(msg.get("content", "")) > 3000:
                msg = {**msg, "content": msg["content"][:3000] + "\n[TRUNCATED]"}
            compressed.append(msg)
        return compressed

    def _checkpoint_and_window(self, messages, keep_last):
        # Persist full history externally, keep only last N turns + system
        system_msgs  = [m for m in messages if m["role"] == "system"]
        recent_turns = [m for m in messages if m["role"] != "system"][-keep_last:]
        return system_msgs + recent_turns

The checkpoint and resume pattern is essential for tasks that genuinely exceed any context window. Before entering the EXHAUSTED zone, serialize the full agent state — task description, completed steps, pending steps, all tool results — to durable storage (S3, Redis, or a task database). Resume by reconstructing a fresh context with only the essential state: system prompt, task definition, a structured summary of completed work, and the next pending action. This pattern enables agents to work on arbitrarily long tasks without being bounded by the context window at all.

6. Failure Scenarios and Recovery

Truncated reasoning chains: The most common failure mode in near-exhausted contexts is that the model's chain-of-thought reasoning gets compressed by the context limit. The model starts generating conclusions without the intermediate steps, producing confident but ungrounded outputs. You can detect this by monitoring the ratio of reasoning tokens to answer tokens in the model's output — a sharp drop in reasoning-to-answer ratio is a reliable signal of context saturation.

Hallucinated tool calls: When the model loses track of which tools it has already called and what they returned, it begins generating tool calls for information it already has (or thinks it has, from hallucinated memory). Implement an idempotent tool call tracker that deduplicates tool calls by signature and returns cached results, preventing the agent from issuing the same expensive API call five times in one loop.

# Detecting context saturation via output quality metrics
def detect_context_saturation(response: str, tool_calls: list) -> bool:
    # Signal 1: very short reasoning before tool calls (model skipping CoT)
    reasoning_tokens = count_tokens_before_first_tool_call(response, tool_calls)
    if reasoning_tokens < 50 and len(tool_calls) > 0:
        return True

    # Signal 2: repeated tool calls for same resource
    call_signatures = [f"{c.name}:{sorted(c.arguments.items())}" for c in tool_calls]
    if len(call_signatures) != len(set(call_signatures)):
        return True

    # Signal 3: model references information not in current context
    # (requires external fact-checking, simplified here)
    return False

Graceful degradation patterns: Rather than failing hard when the context is exhausted, implement tiered degradation. At WARNING level, compress history but continue normally. At CRITICAL level, switch to a model with a larger context window for the current call only (e.g., promote from GPT-4o-mini to Gemini 2.0 Flash with 1M context). At EXHAUSTED level, pause the agent, checkpoint state, notify the orchestrator, and either resume in a fresh context or escalate to a human operator.

7. Cost Optimization

Smart model routing within an agent loop can reduce costs by 60-80% without sacrificing quality. Not every step in an agent loop requires the frontier model. Classify steps by complexity: simple retrieval steps (reading files, listing directories, pattern matching) can run on GPT-4o-mini at 1/16th the cost. Complex reasoning steps (synthesizing findings, writing code, making architecture decisions) warrant the full model. Implement a step classifier that routes based on task type:

STEP_TYPE_MODEL_MAP = {
    "file_read":          "gpt-4o-mini",   # $0.15/1M in
    "search":             "gpt-4o-mini",   # $0.15/1M in
    "data_extraction":    "gpt-4o-mini",   # $0.15/1M in
    "code_generation":    "gpt-4o",        # $2.50/1M in
    "architecture_review":"claude-3-5-sonnet",  # $3.00/1M in
    "security_analysis":  "gpt-4o",        # $2.50/1M in
    "final_synthesis":    "gpt-4o",        # $2.50/1M in
}

def route_model_for_step(step_type: str, context_tokens: int) -> str:
    # Escalate to large-context model if needed, regardless of step type
    if context_tokens > 100_000:
        return "gemini-2.0-flash"  # 1M context at $0.075/1M
    return STEP_TYPE_MODEL_MAP.get(step_type, "gpt-4o-mini")

Prompt caching is the highest-leverage cost optimization for agents with stable system prompts. Anthropic's prompt caching (available on Claude 3.5 Sonnet and Haiku) caches any prefix over 1,024 tokens at 90% discount for subsequent calls within a 5-minute window. For an agent with a 5,000-token system prompt running 30 steps, prompt caching alone saves ~60% of system prompt input costs. Mark your system prompt with the cache_control: {"type": "ephemeral"} header and structure your messages so the system prompt and task description always appear before variable content.

Measuring cost per agent run should be treated as a first-class metric alongside latency and success rate. Instrument every LLM call with token counts, model used, cache hit/miss status, and step type. Aggregate to per-task cost, and set cost budgets (hard limits) per task type. An agent that achieves 95% task success at $0.08 per run is a product you can scale. The same agent at $4.50 per run is a prototype.

"A context window is not infinite RAM. It is a working memory with a hard limit, and the model's coherence degrades long before you hit that limit. The engineering discipline is to treat every token as a resource to be budgeted, allocated, and reclaimed — not as free space to be filled until something breaks."
— Reflecting production patterns from enterprise LLM deployments

Key Takeaways

Conclusion

Token budget management is the infrastructure layer that separates demo-grade AI agents from production-grade ones. The code review agent that works on 10 files and fails silently on 200 is not a buggy AI — it is an unmanaged resource consumer hitting its natural limit. By implementing budget zones, watermark-triggered compression, checkpoint/resume patterns, and smart model routing, you make the agent's behavior predictable and bounded regardless of task size.

The mental model shift is critical: stop thinking of the context window as available space to fill, and start treating every token as a resource to allocate, compress, or reclaim. An agent architecture built on this foundation scales from 10-file PRs to million-token codebases with consistent quality, predictable costs, and observable behavior — the three properties every production system demands.

Discussion / Comments

Related Posts

Agentic AI

AI Memory Management Patterns

Short-term, episodic, and semantic memory architectures for long-horizon AI agents.

Agentic AI

Agentic RAG in Production

Build adaptive retrieval pipelines where agents decide what to retrieve, when, and how much.

Agentic AI

LLMOps in Production

Operationalize LLM-powered systems with evaluation pipelines, cost tracking, and deployment guardrails.

Last updated: March 2026 — Written by Md Sanwar Hossain