AI agent state machine workflow orchestration diagram
Md Sanwar Hossain - Senior Software Engineer
Md Sanwar Hossain

Senior Software Engineer · Agentic AI · Spring Boot · Microservices

Agentic AI March 19, 2026 20 min read Agentic AI in Production Series

AI Agent State Machines: Designing Reliable Multi-Step Workflow Orchestration in Production

The majority of production AI agent failures trace back to a single architectural flaw: implicit, unmodelled control flow. When an LLM loop decides what to do next based solely on its last output, you get unpredictable ghost loops, silent partial failures, and agent runs that complete successfully on paper while producing corrupted results in practice. Explicit finite state machines applied to agent orchestration solve this by making every possible state and transition visible, auditable, and testable — before anything touches a real LLM call.

Table of Contents

  1. The Ghost-Loop Problem in Production Agents
  2. Finite State Machines as Orchestration Primitives
  3. Architecture: States, Transitions, and Guards
  4. Implementation Patterns in Spring Boot
  5. Failure Scenarios and Recovery Strategies
  6. Observability: Tracing Agent State Transitions
  7. Trade-offs and When NOT to Use FSMs
  8. Key Takeaways

1. The Ghost-Loop Problem in Production Agents

Consider a code-review agent tasked with: (1) fetching a pull request diff, (2) analyzing it with an LLM, (3) posting inline comments via the GitHub API, and (4) summarizing the review. The naive implementation is a while(hasMoreSteps) loop driven by the LLM's output. When the GitHub API returns a 429 rate-limit error during step 3, what happens?

In most production incidents we've analyzed, one of three failure modes occurs: the agent silently skips the failing comment (data loss), the agent retries infinitely until the token budget is exhausted (runaway cost), or the agent proceeds to the summary step and reports "review complete" despite posting only 30% of the intended comments (silent corruption). None of these failure modes trigger an alert — they look like successful runs to the metrics system.

Real incident: A SaaS developer-tooling platform deployed a code-review agent handling 200 PRs per day. After a GitHub API rate-limit incident, 34% of PR reviews completed with partial comments posted. Developers saw "Review complete" notifications but missed critical security findings. The issue was discovered two weeks later during a post-mortem on a merged vulnerability — not by the monitoring system.

The root cause is that the agent's control flow was encoded implicitly inside LLM prompts and conditional if-else chains scattered across the codebase. There was no single source of truth for "what state is this agent run in right now?" and no mechanism for recovering to a known safe state when a tool call failed.

2. Finite State Machines as Orchestration Primitives

A finite state machine (FSM) for an AI agent run defines: a finite set of states the agent can be in (e.g., FETCHING_CONTEXT, ANALYZING, POSTING_COMMENTS, SUMMARIZING, COMPLETED, FAILED), a set of transitions (edges between states triggered by events), and optional guard conditions that must be true for a transition to fire. The critical property is determinism: given the current state and an input event, there is exactly one next state.

This is not a theoretical constraint — it is what enables you to answer "where exactly did this agent run fail?" from your database rather than from log archaeology. Every state transition is a persisted event. Replaying those events reconstructs the full agent execution history.

The FSM approach draws from well-established patterns: Spring State Machine, XState (JavaScript), and the Temporal workflow engine all implement variants of this model. The key insight for AI agents is that LLM calls are just transitions — they are external I/O that may succeed, fail, or return unexpected output, and the FSM defines what happens in each case explicitly.

3. Architecture: States, Transitions, and Guards

A production-grade code-review agent FSM has the following state graph:

States:
  IDLE → FETCHING_DIFF → ANALYZING → POSTING_COMMENTS
       → AWAITING_RETRY → SUMMARIZING → COMPLETED
  Any state → FAILED (on unrecoverable error)

Transitions (event → target state, [guard]):
  IDLE        + START_REVIEW           → FETCHING_DIFF
  FETCHING_DIFF + DIFF_LOADED          → ANALYZING
  FETCHING_DIFF + FETCH_ERROR          → FAILED
  ANALYZING   + ANALYSIS_READY        → POSTING_COMMENTS
  ANALYZING   + LLM_TIMEOUT           → ANALYZING        [retries < 3]
  ANALYZING   + LLM_TIMEOUT           → FAILED           [retries >= 3]
  POSTING_COMMENTS + COMMENTS_POSTED  → SUMMARIZING
  POSTING_COMMENTS + RATE_LIMITED     → AWAITING_RETRY   [retries < 5]
  POSTING_COMMENTS + RATE_LIMITED     → FAILED           [retries >= 5]
  AWAITING_RETRY + RETRY_ELAPSED      → POSTING_COMMENTS
  SUMMARIZING + SUMMARY_DONE          → COMPLETED

The guard condition [retries < 3] is the key mechanism that converts the "infinite retry" failure mode into a bounded, deterministic behavior. The guard reads a counter from the agent's persistent context — not from in-memory state — so a JVM crash and restart doesn't reset the counter and allow infinite retries after recovery.

Each state also carries an associated action: FETCHING_DIFF's entry action calls the GitHub diff API; ANALYZING's entry action calls the LLM with the diff as context; POSTING_COMMENTS's entry action iterates through LLM-generated comments and posts each one. Actions are idempotent — re-entering a state after a failure re-executes the action safely.

4. Implementation Patterns in Spring Boot

Spring State Machine (spring-statemachine-core) provides a production-ready FSM implementation. Here is the agent FSM configuration for our code-review agent:

@Configuration
@EnableStateMachineFactory
public class ReviewAgentStateMachineConfig
        extends StateMachineConfigurerAdapter<AgentState, AgentEvent> {

    @Override
    public void configure(StateMachineStateConfigurer<AgentState, AgentEvent> states)
            throws Exception {
        states.withStates()
            .initial(AgentState.IDLE)
            .states(EnumSet.allOf(AgentState.class))
            .end(AgentState.COMPLETED)
            .end(AgentState.FAILED);
    }

    @Override
    public void configure(StateMachineTransitionConfigurer<AgentState, AgentEvent> transitions)
            throws Exception {
        transitions
            .withExternal().source(IDLE).target(FETCHING_DIFF)
                .event(START_REVIEW).action(fetchDiffAction())
            .and()
            .withExternal().source(FETCHING_DIFF).target(ANALYZING)
                .event(DIFF_LOADED).action(analyzeAction())
            .and()
            .withExternal().source(ANALYZING).target(ANALYZING)
                .event(LLM_TIMEOUT).guard(retriesUnder3Guard())
                .action(incrementRetryAction())
            .and()
            .withExternal().source(ANALYZING).target(FAILED)
                .event(LLM_TIMEOUT).guard(retriesAtLeast3Guard())
            .and()
            .withExternal().source(ANALYZING).target(POSTING_COMMENTS)
                .event(ANALYSIS_READY).action(postCommentsAction())
            .and()
            .withExternal().source(POSTING_COMMENTS).target(AWAITING_RETRY)
                .event(RATE_LIMITED).guard(retriesUnder5Guard())
            .and()
            .withExternal().source(AWAITING_RETRY).target(POSTING_COMMENTS)
                .event(RETRY_ELAPSED).action(postCommentsAction())
            .and()
            .withExternal().source(POSTING_COMMENTS).target(SUMMARIZING)
                .event(COMMENTS_POSTED).action(summarizeAction())
            .and()
            .withExternal().source(SUMMARIZING).target(COMPLETED)
                .event(SUMMARY_DONE);
    }
}

The key production addition is persisting the FSM state to a database after every transition. Spring State Machine's StateMachinePersist interface enables this. Store the state machine's serialized context in a agent_runs table keyed by a run UUID. This allows any node in your cluster to resume a failed agent run from the correct state rather than restarting from scratch.

@Component
public class JdbcStateMachinePersist
        implements StateMachinePersist<AgentState, AgentEvent, String> {

    private final AgentRunRepository repo;

    @Override
    public void write(StateMachineContext<AgentState, AgentEvent> ctx, String runId) {
        byte[] serialized = serialize(ctx);
        repo.upsertContext(runId, serialized, ctx.getState().name());
        // also emit a state_transition event to the audit log
    }

    @Override
    public StateMachineContext<AgentState, AgentEvent> read(String runId) {
        return deserialize(repo.findContextByRunId(runId));
    }
}

5. Failure Scenarios and Recovery Strategies

Scenario 1 — LLM timeout mid-analysis: The LLM_TIMEOUT event fires. The FSM checks the retriesUnder3Guard. If true, it transitions back to ANALYZING (re-entering the state) with an incremented retry counter and an exponential backoff delay before re-invoking the LLM. If the guard is false (3 retries exhausted), it transitions to FAILED and publishes a REVIEW_FAILED domain event that triggers a notification to the requesting developer.

Scenario 2 — JVM crash during POSTING_COMMENTS: The agent posts 15 of 20 comments, then the pod crashes. On restart, the FSM is restored from the database to POSTING_COMMENTS state. The action re-executes. Because the comment-posting action is idempotent (it checks whether each comment's unique external reference ID already exists in the GitHub thread before posting), only the 5 unposted comments are created. No duplicates.

Scenario 3 — Rate limiting loop: GitHub throttles the agent. The FSM transitions to AWAITING_RETRY and schedules a retry after the Retry-After header value. The bounded guard ensures maximum 5 rate-limit retries before giving up. Without the FSM, this scenario produced infinite retry loops that consumed the entire daily API quota within an hour.

Key insight: Recovery strategies are first-class citizens of the state graph, not afterthoughts in catch blocks. This means every engineer working on the agent can see the complete failure-handling logic by reading the state machine configuration — no need to hunt through nested exception handlers spread across multiple service classes.

6. Observability: Tracing Agent State Transitions

Every state transition emits a structured log event with: run_id, from_state, to_state, event, guard_result, duration_ms, and any action-specific context (LLM model used, tokens consumed, tool call result). These events are shipped to your observability stack (OpenTelemetry → Jaeger/Tempo) as span events on a per-run trace.

This gives you a complete execution trace for every agent run without adding instrumentation to individual action classes. The Spring State Machine listener interface makes this a one-time cross-cutting concern:

@Component
public class AgentTracingListener
        extends StateMachineListenerAdapter<AgentState, AgentEvent> {

    private final Tracer tracer;

    @Override
    public void transition(Transition<AgentState, AgentEvent> transition) {
        Span currentSpan = tracer.currentSpan();
        if (currentSpan != null) {
            currentSpan.event("agent.state.transition")
                .tag("from", transition.getSource().getId().name())
                .tag("to", transition.getTarget().getId().name())
                .tag("trigger", Optional.ofNullable(transition.getTrigger())
                    .map(t -> t.getEvent().name()).orElse("INTERNAL"));
        }
    }

    @Override
    public void stateMachineError(StateMachine<AgentState, AgentEvent> sm,
                                   Exception exception) {
        // emit alert metric: agent_run_error_total{state=...}
        meterRegistry.counter("agent.run.error",
            "state", sm.getState().getId().name()).increment();
    }
}

With this in place, your Grafana dashboard can show: average time spent in each state per agent type, error rates by transition (which specific step fails most often), retry counts over time (early warning of external API degradation), and the percentage of runs reaching COMPLETED vs FAILED — giving you an SLO-style reliability metric for every agent workflow.

7. Trade-offs and When NOT to Use FSMs

When FSMs shine: Agents with a bounded, well-defined set of steps. Long-running agents (minutes to hours) where crash recovery is required. Multi-step workflows with complex retry and compensation logic. Any agent where you need per-run audit trails for compliance.

When FSMs become a burden: Highly dynamic agents where the number of steps is determined at runtime by the LLM itself (open-ended planning agents). Pure conversational agents with no tool calls and no side effects — simple stateless request-response loops don't need FSM overhead. Prototype/research agents where iteration speed matters more than reliability.

Alternatives: For open-ended planning, Temporal workflows provide durable execution with more flexible control flow than a strict FSM. For simple sequential pipelines, LangGraph (Python) or Spring AI's tool-calling loop with explicit retry policies may be sufficient without the full FSM machinery.

Hidden cost: Every new agent feature requires updating the state graph. Teams must resist the temptation to add "just one more" state that bypasses the guard conditions. Periodic FSM graph reviews — similar to schema migration reviews — help keep the model maintainable.

Key Takeaways

Conclusion

AI agents in production are distributed systems that happen to involve an LLM. They are subject to the same classes of failures — partial writes, network timeouts, process crashes, and external API degradation — that every distributed system engineer has learned to handle with explicit state management, idempotency, and bounded retries. Applying finite state machine principles to agent orchestration is not over-engineering; it is applying decades of hard-won distributed systems knowledge to a new problem domain.

Start by drawing your agent's state graph on a whiteboard before writing any code. Every box is a state, every arrow is a transition, every diamond is a guard. If you can't draw the graph, you don't yet understand the failure modes. Once the graph is clear, the implementation — whether via Spring State Machine, Temporal, or a custom FSM — is straightforward. The architecture diagram is the specification.

Discussion / Comments

Related Posts

Agentic AI

Self-Healing AI Agents

Design AI agents that detect and recover from their own failures in production environments.

Agentic AI

Multi-Agent Systems

Orchestrate networks of specialized AI agents to solve complex, decomposable problems.

Agentic AI

Agentic AI Design Patterns

ReAct, CoT, and self-reflection patterns for production-grade agentic AI systems.

Last updated: March 2026 — Written by Md Sanwar Hossain