Agentic AI Design Patterns: ReAct, Chain of Thought & Self-Reflection in Production (2026)

Agentic AI design patterns ReAct Chain of Thought production

The shift from prompt engineering to agent architecture engineering is one of the most significant transitions in applied AI. While a clever prompt can solve a single-turn problem, production agentic systems require systematic patterns that determine how an agent reasons, acts, reflects, and collaborates. This guide covers six battle-tested patterns with pseudocode, tradeoff analysis, and hard-won production lessons.

Why Patterns Matter for AI Agents

A well-prompted LLM is not an agent. An agent has a control loop: it perceives state, selects and executes actions (tool calls, sub-queries, code execution), observes results, updates its internal reasoning, and iterates until the goal is achieved or a stopping condition is met. This control loop introduces failure modes that simple prompting never encounters: infinite loops, cost explosion from recursive calls, contradictory observations from multiple tool invocations, and context window overflow from accumulating traces.

Without deliberate architectural patterns, production agents exhibit characteristic failure modes. An agent searching the web for a research question may loop infinitely between search, read, and re-search without ever producing an answer because it lacks a convergence criterion. An agent writing code may regenerate the same buggy solution 20 times because it has no mechanism to evaluate its own output against a quality standard. These are not model failures — they are architectural failures. Patterns provide the scaffolding that prevents them.

The choice of pattern is not cosmetic. ReAct costs 3–8 LLM calls per task; Tree of Thoughts costs 15–50 calls; Multi-Agent systems can cost hundreds. At GPT-4-class pricing, pattern choice directly determines whether a feature is economically viable at scale. Understanding when to use each pattern — and when to use a simpler fallback — is the core skill of agentic AI engineering.

Pattern 1: ReAct (Reason + Act)

ReAct, introduced by Yao et al. (2022), interleaves reasoning traces with tool-use actions in a single generation loop. The agent alternates between a Thought step (free-form reasoning about what to do next) and an Action step (calling a specific tool with specific arguments), then receives an Observation (the tool's output). This thought-action-observation cycle repeats until the agent produces a final answer. The key insight is that generating reasoning traces alongside actions dramatically reduces hallucination — the model grounds each reasoning step in observed evidence.

ReAct is the right pattern for tasks requiring multi-step information gathering with tool use: customer support agents querying CRM and knowledge base, coding agents searching documentation and executing code, research agents combining web search with document retrieval. It excels when the required tools are well-defined and the task has a clear terminal condition.

# ReAct agent loop (pseudocode)
def react_agent(user_query: str, tools: dict, max_steps: int = 10) -> str:
    context = [{"role": "user", "content": user_query}]
    
    for step in range(max_steps):
        # Generate next Thought + Action
        response = llm.generate(
            messages=context,
            stop=["Observation:"],   # stop before hallucinating observations
            temperature=0.0
        )
        
        thought_action = response.text
        context.append({"role": "assistant", "content": thought_action})
        
        # Parse action from response
        action = parse_action(thought_action)  # e.g., {"tool": "search", "query": "..."}
        
        if action.tool == "final_answer":
            return action.value
        
        # Execute tool and observe result
        observation = tools[action.tool](**action.params)
        context.append({"role": "user", "content": f"Observation: {observation}"})
    
    # Max steps reached — force final answer
    return llm.generate(messages=context + [{"role": "user", "content": "Provide your final answer now."}])

# Prompt structure for ReAct:
# "You are a research assistant. For each step, write:
#  Thought: [your reasoning]
#  Action: [tool_name(param1=val1, param2=val2)]
# Wait for Observation before the next Thought."

ReAct's primary limitation is loop entrapment: if the agent receives an ambiguous or contradictory observation, it may cycle through the same thought-action sequence indefinitely. Mitigate this with a max_steps hard limit, step-level deduplication (detect repeated action sequences and break), and a forced-answer prompt after limit exhaustion. Also note that ReAct is sensitive to tool description quality — a poorly documented tool leads to malformed action calls and wasted LLM calls on error recovery.

Pattern 2: Chain of Thought (CoT)

Chain of Thought prompting, introduced by Wei et al. (2022), dramatically improves LLM performance on multi-step reasoning problems by instructing the model to produce intermediate reasoning steps before the final answer. The mechanism is emergent: generating intermediate steps forces the model to allocate computation to each reasoning step rather than collapsing to a direct answer, improving accuracy on arithmetic, symbolic reasoning, and logical deduction tasks.

Zero-shot CoT is as simple as appending "Let's think step by step." to the prompt. This single phrase reliably elicits chain-of-thought reasoning from GPT-4 class models. Few-shot CoT provides 3–5 example problems with worked solutions in the prompt, calibrating the reasoning style and output format for your specific domain. Self-consistency CoT generates N completions at temperature >0 and takes a majority vote — the most frequent final answer wins. This dramatically increases accuracy on problems where the model occasionally makes errors: instead of being unlucky once, it must err on the majority of N paths.

# Self-consistency CoT (majority vote across N completions)
def self_consistency_cot(problem: str, n_samples: int = 7, temperature: float = 0.7) -> str:
    prompt = f"""Solve the following problem step by step, then state your final answer clearly.
Problem: {problem}
Let's think step by step:"""
    
    answers = []
    for _ in range(n_samples):
        response = llm.generate(prompt, temperature=temperature)
        final_answer = extract_final_answer(response)  # parse "Answer: X" from response
        answers.append(final_answer)
    
    # Majority vote
    from collections import Counter
    most_common = Counter(answers).most_common(1)[0][0]
    return most_common

# Temperature guidance:
# temperature=0.0  → deterministic, use for production single-path CoT
# temperature=0.7  → diverse paths, use for self-consistency CoT
# temperature=1.0  → high diversity, use for creative brainstorming with ToT

CoT is most effective for problems that decompose naturally into sequential logical steps: mathematical word problems, code debugging, legal reasoning, medical diagnosis. It underperforms on problems requiring backtracking or exploring mutually exclusive hypotheses — these are better served by Tree of Thoughts. Latency trade-off: CoT generates more tokens (reasoning trace + answer vs. direct answer), increasing cost by 2–4× but improving accuracy by 20–40% on hard reasoning benchmarks.

Pattern 3: Tree of Thoughts (ToT)

Tree of Thoughts, introduced by Yao et al. (2023), generalizes CoT by allowing the model to explore multiple reasoning paths simultaneously, evaluate intermediate states, and backtrack to more promising branches. Where CoT commits to a single linear reasoning trace, ToT maintains a tree of partial solutions, pruning unpromising branches and expanding promising ones — analogous to chess engines evaluating move trees.

ToT outperforms CoT on problems requiring hypothesis exploration, creative generation with quality evaluation, or planning with backtracking: code generation where multiple approaches must be evaluated, mathematical proof construction, game-playing agents, and complex multi-constraint optimization. The cost is significant: a typical ToT run with branching factor 3 and depth 4 requires 3^4 = 81 LLM calls at minimum, plus evaluation calls. This makes ToT economically viable only for high-value, low-frequency tasks.

# Simplified Tree of Thoughts (breadth-first with beam search)
def tree_of_thoughts(problem: str, beam_width: int = 3, max_depth: int = 4) -> str:
    # Initialize with N independent partial solutions
    thoughts = [generate_thought(problem) for _ in range(beam_width)]
    
    for depth in range(max_depth):
        # Evaluate all current thoughts
        scored = [(t, evaluate_thought(problem, t)) for t in thoughts]
        
        # Check for terminal solutions
        solutions = [(t, s) for t, s in scored if is_solution(t)]
        if solutions:
            return max(solutions, key=lambda x: x[1])[0]
        
        # Keep top beam_width thoughts and expand each
        top_thoughts = sorted(scored, key=lambda x: x[1], reverse=True)[:beam_width]
        next_thoughts = []
        for thought, score in top_thoughts:
            # Generate 2 continuations per thought
            continuations = [continue_thought(problem, thought) for _ in range(2)]
            next_thoughts.extend(continuations)
        
        thoughts = next_thoughts
    
    # Return best final thought if no clean solution found
    return max(thoughts, key=lambda t: evaluate_thought(problem, t))

# Evaluation prompt example:
# "Given this partial solution, rate its quality from 1-10 and explain why.
#  Problem: {problem}
#  Partial solution: {thought}
#  Rating (1-10):"

Pattern 4: Reflexion (Self-Reflection)

Reflexion, introduced by Shinn et al. (2023), adds a metacognitive layer to agent loops: after each attempt, a dedicated reflection step evaluates the output against success criteria, generates a structured critique, and uses that critique to guide the next attempt. This mimics the human learning pattern of "try → assess → improve" and is particularly effective for tasks with verifiable correctness criteria: code generation (pass/fail tests), factual Q&A (verifiable against ground truth), and constrained writing (word count, format requirements).

The critical engineering concern in Reflexion is preventing infinite reflection loops. A poorly designed Reflexion agent that never achieves its success criterion will keep reflecting and regenerating indefinitely, accumulating massive token costs. Always implement a hard max reflection depth (typically 3–5) and a fallback policy (return best attempt so far) when the limit is reached. Also be aware that reflection quality degrades beyond depth 3: the model begins generating tautological critiques ("the solution is wrong because it produces wrong output") rather than actionable improvements.

# Reflexion loop pseudocode
def reflexion_agent(task: str, success_fn: callable, max_reflections: int = 3) -> str:
    memory = []       # accumulate critiques across reflection rounds
    best_attempt = None
    best_score = 0.0
    
    for round_num in range(max_reflections + 1):  # +1 for initial attempt
        # Generate attempt, incorporating previous reflections
        reflection_context = "\n".join(memory) if memory else "No prior attempts."
        attempt = llm.generate(f"""
Task: {task}
Prior attempts and reflections:
{reflection_context}
Generate your {'improved ' if memory else ''}solution:""")
        
        # Evaluate against verifiable success criteria
        score, feedback = success_fn(attempt)
        
        if score > best_score:
            best_score = score
            best_attempt = attempt
        
        if score >= 0.95:   # success threshold
            return attempt
        
        if round_num < max_reflections:
            # Generate structured reflection for next round
            reflection = llm.generate(f"""
Attempt: {attempt}
Evaluation feedback: {feedback}
Score: {score:.2f}

Write a concise reflection identifying:
1. What specifically went wrong
2. What to do differently next time
Keep it under 100 words.""")
            memory.append(f"Round {round_num+1} reflection: {reflection}")
    
    return best_attempt  # return best found within budget

Pattern 5: Plan and Execute

Plan-and-Execute separates the cognitive work of planning from the mechanical work of execution into distinct agent roles. A Planner agent (typically a high-capability model like GPT-4o) receives the high-level goal and decomposes it into a structured list of subtasks with dependencies. An Executor agent (can be a cheaper, specialized model) carries out individual subtasks in sequence or in parallel where dependencies allow. A Replanner agent re-evaluates remaining steps after each execution result and adjusts the plan dynamically if reality diverges from the plan.

This pattern excels at long-horizon tasks with clear decomposability: software feature development (plan → write tests → implement → verify → document), data analysis pipelines (plan → fetch data → clean → analyze → visualize → report), and multi-tool automation workflows. The explicit plan provides observability — you can log the plan, show it to users for approval, and audit what the agent intended to do. Dynamic replanning handles the inevitable divergence between plan and reality in complex multi-step tasks.

# Plan-and-Execute with dynamic replanning
def plan_execute_agent(goal: str, tools: dict) -> str:
    # Phase 1: Generate structured plan
    plan = planner_llm.generate(f"""
Goal: {goal}
Available tools: {list(tools.keys())}

Create a numbered plan of concrete steps. For each step specify:
- Action: what to do
- Tool: which tool to use (or "llm" for reasoning steps)
- Depends on: step numbers this step requires first

Output as JSON array.""")
    
    steps = parse_plan(plan)   # list of {action, tool, depends_on}
    results = {}
    
    # Phase 2: Execute steps respecting dependencies
    for i, step in enumerate(steps):
        # Wait for dependencies
        dependency_context = {dep: results[dep] for dep in step["depends_on"]}
        
        # Execute step
        result = executor_agent(step["action"], step["tool"], dependency_context, tools)
        results[i] = result
        
        # Phase 3: Dynamic replan after each step
        remaining_steps = steps[i+1:]
        if remaining_steps:
            revised = replanner_llm.generate(f"""
Original goal: {goal}
Completed steps: {list(results.values())}
Latest result: {result}
Remaining steps: {remaining_steps}

Should the plan be revised? If yes, output the updated remaining steps as JSON.
If no, output "NO_CHANGE".""")
            if revised != "NO_CHANGE":
                steps[i+1:] = parse_plan(revised)
    
    return synthesize_results(goal, results)

Pattern 6: Multi-Agent Orchestration

Multi-Agent Orchestration deploys a network of specialized sub-agents coordinated by a supervisor. The supervisor receives the user goal, routes subtasks to specialist agents based on capability matching, aggregates results, and determines when the overall goal is achieved. Specialist agents are purpose-built: a CodeAgent with code execution tools, a ResearchAgent with web search and document retrieval, a DataAgent with SQL and visualization tools, and a WriterAgent for synthesis and formatting.

The power of multi-agent orchestration is specialization and parallelism. Independent subtasks execute simultaneously across multiple specialist agents — a research task that would take 8 sequential ReAct steps in a single agent might complete in 3 steps with 4 parallel agents working different subtasks. Shared memory (typically a structured context store like a Redis hash or in-memory dict) allows agents to read each other's results without re-doing completed work.

# LangGraph-style multi-agent graph definition (pseudocode)
from langgraph import StateGraph, END

# Define shared state schema
class AgentState(TypedDict):
    goal: str
    research_results: list
    code_output: str
    analysis: str
    final_report: str
    next_agent: str   # routing signal

# Define specialist agents
def supervisor_node(state: AgentState) -> AgentState:
    routing_decision = supervisor_llm.generate(f"""
    Goal: {state['goal']}
    Research done: {bool(state['research_results'])}
    Code executed: {bool(state['code_output'])}
    Analysis done: {bool(state['analysis'])}
    
    Which agent should act next?
    Options: research_agent | code_agent | analysis_agent | writer_agent | END
    """)
    state['next_agent'] = parse_routing(routing_decision)
    return state

def research_agent_node(state: AgentState) -> AgentState:
    results = research_agent.run(state['goal'])
    state['research_results'] = results
    state['next_agent'] = 'supervisor'
    return state

def code_agent_node(state: AgentState) -> AgentState:
    output = code_agent.run(state['goal'], context=state['research_results'])
    state['code_output'] = output
    state['next_agent'] = 'supervisor'
    return state

# Build graph
graph = StateGraph(AgentState)
graph.add_node("supervisor", supervisor_node)
graph.add_node("research_agent", research_agent_node)
graph.add_node("code_agent", code_agent_node)
graph.add_node("analysis_agent", analysis_agent_node)
graph.add_node("writer_agent", writer_agent_node)

# Add conditional routing edges
graph.add_conditional_edges("supervisor", lambda s: s['next_agent'], {
    "research_agent": "research_agent",
    "code_agent": "code_agent",
    "analysis_agent": "analysis_agent",
    "writer_agent": "writer_agent",
    "END": END
})
# All specialist agents route back to supervisor
for agent in ["research_agent", "code_agent", "analysis_agent", "writer_agent"]:
    graph.add_edge(agent, "supervisor")

app = graph.compile()

Choosing the Right Pattern for Your Use Case

Pattern selection must balance task complexity, acceptable latency, cost budget, and the consequences of failure. A customer-facing chatbot that answers product questions in real-time cannot use Tree of Thoughts (50+ LLM calls per query) — use zero-shot CoT with a single tool call. An overnight analysis job with a $10 cost budget and no latency constraint can afford Reflexion with full multi-agent orchestration.

Use CaseRecommended PatternLLM Calls
Simple Q&A / single-step reasoningZero-shot CoT1
Multi-tool information gatheringReAct3–10
High-stakes reasoning (medical, legal)Self-consistency CoT5–10
Creative planning / complex problemTree of Thoughts15–100
Code generation with testsReflexion3–20
Long-horizon task decompositionPlan and Execute10–50
Complex workflow with specialistsMulti-Agent Orchestration50–500

Production Pitfalls & How to Avoid Them

Cost explosion from unbounded recursive patterns. Reflexion without a depth limit, ToT with a large branching factor, or multi-agent loops without terminal conditions can consume hundreds of dollars per task. Always implement hard call budgets at the orchestration layer, not just as prompting guidelines. Track cumulative token spend per user session and abort with a graceful degradation response when budget is exceeded.

Latency budgets and user experience. A ReAct agent with 8 tool calls at 2 seconds each takes 16 seconds minimum — unacceptable for a real-time user interaction. Stream intermediate reasoning traces to the UI while execution proceeds. LangChain's streaming callbacks and LangGraph's event streaming allow you to show "thinking..." indicators with visible reasoning steps, converting a 16-second wait into a perceived 2-second interaction with progressive disclosure of work being done.

Fallback to simpler patterns on timeout. Implement tiered fallback: attempt multi-agent orchestration with a 10-second timeout; if exceeded, fallback to single-agent ReAct with a 5-second timeout; if exceeded, fall back to direct CoT with a 2-second timeout. This ensures a response is always returned, with quality degrading gracefully under load rather than timing out. Log which tier served each request — a high fallback rate signals over-architected pattern selection for the actual query distribution.

Context window overflow. Long ReAct traces, accumulated Reflexion reflections, and multi-agent state can collectively exceed even 128K context window models. Implement sliding window truncation (keep first system prompt + last N steps), summarize old steps before truncating, and treat context management as a first-class concern, not an afterthought.

FAQs: Agentic AI Patterns

Q: Can I combine multiple patterns in one agent?
A: Yes, and you often should. A production agent might use Plan-and-Execute at the top level, ReAct within each execution step, and Reflexion for code generation subtasks. The patterns compose hierarchically. Just be disciplined about cost budgets — composed patterns multiply LLM call counts.

Q: Is ReAct the same as function calling?
A: No. Function calling (or tool use) is a capability of the underlying model — the mechanism by which an LLM can invoke tools. ReAct is an architectural pattern that structures how reasoning and tool use are interleaved. You can implement ReAct using function calling, or using structured text parsing, or using a dedicated tool-use framework.

Q: How do I debug a multi-agent system?
A: Trace every LLM call with a unique span ID (using OpenTelemetry or LangSmith), log the full prompt and response for each call, record the agent state before and after each node execution, and implement replay capability so you can re-run a failing trace from any checkpoint. Debugging by reading production logs is insufficient — you need structured traces.

Q: What model should I use for each agent role?
A: Use your highest-capability model (GPT-4o, Claude 3.5 Sonnet) for the planner/supervisor role where reasoning quality determines overall task success. Use smaller, faster models (GPT-4o-mini, Claude 3 Haiku) for executor roles performing well-defined subtasks. This hybrid approach reduces cost by 60–80% compared to using the flagship model for every call.

Q: How do I prevent prompt injection in multi-agent systems?
A: Sanitize all external content (web pages, database results, API responses) before injecting it into agent context. Use structured output formats (JSON schema enforcement) so injected content cannot masquerade as system instructions. Implement a content safety classifier on observations before they enter the agent reasoning loop. Never give agents unrestricted write access to production systems — always require human-in-the-loop approval for destructive or irreversible actions.

Key Takeaways

  • Pattern choice is an economic and architectural decision: CoT costs 1 call; ToT costs 50+. Match pattern complexity to task value and latency constraints.
  • ReAct prevents hallucination through grounding: Interleaving observations with reasoning steps anchors the model in verified facts rather than generated plausibilities.
  • Self-consistency is a cheap accuracy booster: Running 7 CoT paths and majority-voting costs 7× but improves accuracy by 15–40% on hard reasoning tasks — often the best ROI pattern.
  • Reflexion needs hard depth limits: Without them, cost explosion and degrading critique quality make it worse than a single attempt beyond depth 3–5.
  • Multi-agent orchestration requires tracing: A graph of 5 agents each making 10 LLM calls produces 50 calls per request. Without structured distributed tracing, debugging production failures is effectively impossible.
  • Always implement graceful degradation: Tiered fallback from complex to simple patterns prevents timeout failures and ensures users always receive a response.

Related Articles

Discussion / Comments

Join the conversation — your comment goes directly to my inbox.

← Back to Blog