LLM Gateway in Production: Provider Fallbacks, Semantic Caching & Cost Control with LiteLLM and Spring AI 2026
Running LLM workloads in production without a gateway is like running microservices without an API gateway — it works until it doesn't. Provider outages, runaway token costs, and sensitive PII leaking into third-party APIs are all real production incidents. This guide shows you how to build and operate an LLM gateway that handles provider fallbacks, semantic caching, rate limiting, PII redaction, and intelligent model routing using LiteLLM and Spring AI.
TL;DR — What an LLM Gateway Buys You
"An LLM gateway is a unified control plane that sits between your application and multiple LLM providers. It delivers provider fallbacks (99.9%+ availability), semantic caching (30–50% cost reduction), PII redaction (compliance), rate limiting (cost governance), and intelligent model routing (2–5× cost savings) — all without changing application code."
Table of Contents
- What Is an LLM Gateway?
- Multi-Provider Routing & Fallbacks
- Semantic Caching: 30–50% Cost Reduction
- Cost Tracking & Token Budgets
- Rate Limiting & Quota Management
- PII Redaction & Prompt Safety
- Intelligent Model Routing
- LiteLLM Proxy Setup & Java Integration
- Observability: Tracing, Metrics & Alerts
- Production Patterns & Anti-Patterns
- Conclusion & Implementation Checklist
1. What Is an LLM Gateway?
An LLM gateway is a reverse proxy purpose-built for large language model traffic. Just as an API gateway normalises HTTP requests across backend microservices, an LLM gateway normalises prompt/completion traffic across multiple AI providers — OpenAI, Anthropic, Google Gemini, Mistral, AWS Bedrock, Azure OpenAI — through a single, vendor-neutral endpoint.
Without a gateway, each team that ships an LLM feature directly hard-codes provider credentials, invents its own retry logic, and has no shared visibility into token spend. By the time you have five such teams, you have five different cost centres, five different failure modes, and zero centralised governance. The LLM gateway is the answer to that organisational and operational sprawl.
Six Core Capabilities
- Provider abstraction: Single OpenAI-compatible API endpoint that routes to any backend. Application code never changes when you add or swap providers.
- Fallback chains: If the primary provider returns a 5xx or exceeds rate limits, the gateway automatically retries the next provider in the chain without surfacing errors to the caller.
- Semantic caching: Semantically similar prompts return cached completions, eliminating redundant LLM calls. Typical cache hit rates of 20–50% in production.
- Cost tracking & budgets: Every token counted, attributed to a tenant/team, and checked against configurable budgets. Enforced throttling when budgets are exceeded.
- PII redaction: Outbound prompts are scanned for personally identifiable information. Sensitive fields are masked before the request leaves your network perimeter.
- Intelligent routing: Route simple queries to smaller, cheaper models (GPT-4o-mini, Gemini Flash) and complex queries to frontier models (GPT-4o, Claude Opus) based on automated complexity classification.
LLM Gateway vs API Gateway vs Service Mesh
| Capability | API Gateway | Service Mesh | LLM Gateway |
|---|---|---|---|
| Traffic layer | HTTP/REST/gRPC | Service-to-service (L4/L7) | LLM prompt/completion |
| Rate limiting | Requests/sec | Requests/sec | Tokens/min per model |
| Caching | Exact-match HTTP cache | None | Semantic vector cache |
| Cost awareness | None | None | Per-token billing, budgets |
| PII handling | Custom middleware | None | Built-in redaction filters |
| Provider fallback | Manual circuit breaker | Envoy outlier detection | Native multi-provider failover |
The key insight is that LLM traffic has fundamentally different cost and failure semantics than regular HTTP traffic. A single LLM call can cost $0.001–$0.15 and take 2–30 seconds. A service mesh has no awareness of either dimension. An LLM gateway is built from the ground up for this reality.
2. Multi-Provider Routing & Fallbacks
Relying on a single LLM provider is a single point of failure. OpenAI has had multiple high-profile outages that lasted hours. Anthropic, Google, and AWS Bedrock are not immune either. A production LLM gateway must implement provider fallback chains with per-model circuit breakers.
Primary/Backup Provider Chains
A fallback chain defines an ordered list of provider/model combinations. When the gateway receives a 429 (rate limit), 500, 502, or 503 from the primary provider, it immediately retries the next entry in the chain. The caller receives a successful response — or, if all providers fail, a single structured error after exhausting the chain.
- Primary: OpenAI GPT-4o — lowest latency for most workloads
- Fallback 1: Anthropic Claude Sonnet 3.5 — comparable quality, different failure domain
- Fallback 2: Google Gemini Pro — geographic diversity, good availability SLA
- Fallback 3: AWS Bedrock Claude — private VPC endpoint, no internet dependency
Circuit Breaker Per Model
A naive retry-on-failure approach is not enough. If OpenAI is experiencing a partial outage with 30% of requests failing and 10-second timeouts before failure, blindly retrying every request triples your p99 latency. A circuit breaker tracks error rate over a sliding window and trips to OPEN state when the threshold is crossed — routing all traffic to the next provider without attempting the failed one at all.
Spring AI Retry Configuration with Fallback
// application.yml — Spring AI multi-provider fallback config
spring:
ai:
openai:
api-key: ${OPENAI_API_KEY}
base-url: http://localhost:4000 # LiteLLM proxy
chat:
options:
model: gpt-4o
temperature: 0.7
max-tokens: 2048
// LlmGatewayConfig.java
@Configuration
public class LlmGatewayConfig {
@Bean
@Primary
public ChatClient primaryChatClient(OpenAiChatModel openAiChatModel) {
return ChatClient.builder(openAiChatModel)
.defaultAdvisors(new SimpleLoggerAdvisor())
.build();
}
@Bean("fallbackChatClient")
public ChatClient fallbackChatClient(AnthropicChatModel anthropicModel) {
return ChatClient.builder(anthropicModel)
.defaultAdvisors(new SimpleLoggerAdvisor())
.build();
}
}
// LlmGatewayService.java
@Service
@Slf4j
public class LlmGatewayService {
private final ChatClient primaryClient;
private final ChatClient fallbackClient;
private final CircuitBreakerRegistry circuitBreakerRegistry;
public LlmGatewayService(
@Primary ChatClient primaryClient,
@Qualifier("fallbackChatClient") ChatClient fallbackClient,
CircuitBreakerRegistry circuitBreakerRegistry) {
this.primaryClient = primaryClient;
this.fallbackClient = fallbackClient;
this.circuitBreakerRegistry = circuitBreakerRegistry;
}
public String complete(String userMessage) {
CircuitBreaker cb = circuitBreakerRegistry.circuitBreaker("openai-primary");
return cb.executeSupplier(
() -> callWithClient(primaryClient, userMessage),
throwable -> {
log.warn("Primary provider failed, falling back: {}", throwable.getMessage());
return callWithClient(fallbackClient, userMessage);
}
);
}
private String callWithClient(ChatClient client, String message) {
return client.prompt()
.user(message)
.call()
.content();
}
}
The circuit breaker is configured with Resilience4j. A sliding window of 20 calls, a failure rate threshold of 50%, and a wait duration of 30 seconds in OPEN state before attempting HALF_OPEN recovery are sensible defaults for LLM providers. With LiteLLM as the gateway proxy, this fallback logic can also be centralised in the proxy itself, making the Spring AI client code even simpler.
3. Semantic Caching: 30–50% Cost Reduction
Exact-match HTTP caching is nearly useless for LLM traffic because users never phrase the same question with identical wording. Semantic caching solves this by comparing the meaning of incoming prompts to cached prompts using vector similarity. If a new prompt is semantically close enough to a cached one (above a configurable cosine similarity threshold), the cached response is returned without calling the LLM at all.
How Vector Similarity Caching Works
- Embed the incoming prompt using a fast, cheap embedding model (text-embedding-3-small, ~$0.00002/1K tokens).
- Query Redis (with RediSearch vector index) for the nearest cached prompt embedding within a similarity threshold.
- On cache hit (similarity ≥ threshold): Return cached completion immediately. Latency drops from ~1–5s to <10ms. Cost: $0.
- On cache miss: Call the LLM, store the result with the embedding in Redis with a configurable TTL.
- Invalidate based on TTL (for time-sensitive content) or manually (after knowledge base updates).
Cache Hit Ratio Math
Assume: 100,000 LLM requests/day at $0.005 average cost = $500/day.
With 35% semantic cache hit rate: 65,000 live calls × $0.005 = $325/day (35% saving).
Monthly saving: ~$5,250. Embedding cost for 100K queries: ~$2/day. Net ROI is overwhelmingly positive from day one.
Cosine Similarity Threshold Tuning
The similarity threshold is the most sensitive hyperparameter. Too high (e.g., 0.99) and you get near-zero cache hits. Too low (e.g., 0.80) and semantically different prompts return wrong answers. In practice, 0.92–0.95 works well for FAQ and customer support use cases, while 0.97–0.99 is safer for precise technical or legal contexts. Always A/B test threshold changes against a golden eval set before deploying.
Java Semantic Cache Implementation
// SemanticCacheService.java — Redis vector cache for LLM completions
@Service
@RequiredArgsConstructor
public class SemanticCacheService {
private final EmbeddingModel embeddingModel; // Spring AI EmbeddingModel
private final RedisTemplate<String, String> redis;
private final ObjectMapper objectMapper;
private static final double SIMILARITY_THRESHOLD = 0.93;
private static final Duration CACHE_TTL = Duration.ofHours(24);
private static final String CACHE_KEY_PREFIX = "llm:cache:";
public Optional<String> findCachedResponse(String prompt) {
float[] queryEmbedding = embeddingModel.embed(prompt);
// In production use RediSearch KNN: FT.SEARCH llm-idx
// "*=>[KNN 1 @embedding $vec AS score]"
// Here we show a simplified scan-based approach for illustration
Set<String> keys = redis.keys(CACHE_KEY_PREFIX + "*");
if (keys == null) return Optional.empty();
for (String key : keys) {
String json = redis.opsForValue().get(key);
if (json == null) continue;
try {
CacheEntry entry = objectMapper.readValue(json, CacheEntry.class);
double similarity = cosineSimilarity(queryEmbedding, entry.embedding());
if (similarity >= SIMILARITY_THRESHOLD) {
return Optional.of(entry.completion());
}
} catch (Exception ignored) {}
}
return Optional.empty();
}
public void cacheResponse(String prompt, String completion) {
float[] embedding = embeddingModel.embed(prompt);
String key = CACHE_KEY_PREFIX + UUID.randomUUID();
CacheEntry entry = new CacheEntry(prompt, completion, embedding, Instant.now());
try {
redis.opsForValue().set(key, objectMapper.writeValueAsString(entry), CACHE_TTL);
} catch (JsonProcessingException e) {
log.warn("Failed to cache LLM response", e);
}
}
private double cosineSimilarity(float[] a, float[] b) {
double dot = 0, normA = 0, normB = 0;
for (int i = 0; i < a.length; i++) {
dot += a[i] * b[i];
normA += a[i] * a[i];
normB += b[i] * b[i];
}
return dot / (Math.sqrt(normA) * Math.sqrt(normB));
}
public record CacheEntry(String prompt, String completion,
float[] embedding, Instant cachedAt) {}
}
In production, replace the key-scan loop with a proper RediSearch KNN query for O(log N) lookups instead of O(N). LiteLLM's built-in Redis semantic cache also provides this natively if you prefer not to build it yourself.
4. Cost Tracking & Token Budgets
Without cost tracking, LLM costs are invisible until the monthly bill arrives. By then, a runaway agent loop or an unintentionally verbose system prompt may have burned thousands of dollars. The gateway must track cost in real time and enforce tenant-level token budgets.
Per-Tenant Token Budgets
Every request carries a tenant identifier (from the API key, JWT claim, or request header). The gateway maintains a counter per tenant per time window (hourly, daily, monthly). When the budget is exhausted, requests return 429 Too Many Requests with a Retry-After header, preventing cost overruns without requiring code changes in the application.
Cost Per Request Logging
Every LLM response includes usage.prompt_tokens and usage.completion_tokens. The gateway multiplies these by the per-model pricing table (maintained in config, updated monthly) to produce a cost figure per request. This data is written to a time-series store (InfluxDB, Prometheus, or your data warehouse) for billing dashboards and anomaly detection.
Prometheus Metrics for Cost Tracking
// LlmCostMetrics.java — Micrometer metrics for token cost tracking
@Component
@RequiredArgsConstructor
public class LlmCostMetrics {
private final MeterRegistry meterRegistry;
// Counters
private Counter tokenCounter(String tenant, String model, String type) {
return Counter.builder("llm.tokens.total")
.tag("tenant", tenant)
.tag("model", model)
.tag("type", type) // "prompt" or "completion"
.register(meterRegistry);
}
private DistributionSummary costSummary(String tenant, String model) {
return DistributionSummary.builder("llm.cost.usd")
.tag("tenant", tenant)
.tag("model", model)
.baseUnit("USD")
.scale(0.000001) // store in micro-dollars for precision
.register(meterRegistry);
}
public void recordUsage(String tenant, String model,
int promptTokens, int completionTokens) {
tokenCounter(tenant, model, "prompt").increment(promptTokens);
tokenCounter(tenant, model, "completion").increment(completionTokens);
double cost = calculateCost(model, promptTokens, completionTokens);
costSummary(tenant, model).record(cost);
// Alert if single request exceeds $0.50
if (cost > 0.50) {
log.warn("High-cost LLM request: tenant={} model={} cost=${:.4f}",
tenant, model, cost);
Metrics.counter("llm.high_cost_requests",
"tenant", tenant, "model", model).increment();
}
}
private double calculateCost(String model, int prompt, int completion) {
// Prices per 1M tokens (April 2026)
return switch (model) {
case "gpt-4o" -> (prompt * 5.0 + completion * 15.0) / 1_000_000;
case "gpt-4o-mini" -> (prompt * 0.15 + completion * 0.6) / 1_000_000;
case "claude-sonnet-3-5" -> (prompt * 3.0 + completion * 15.0) / 1_000_000;
default -> (prompt * 2.0 + completion * 8.0) / 1_000_000;
};
}
}
Pair these Prometheus metrics with a Grafana dashboard that shows daily spend by tenant and model, top-10 costliest request types, and a burn-rate graph against monthly budget. Set alerting rules at 50%, 80%, and 95% of monthly budget with escalating notification channels.
5. Rate Limiting & Quota Management
LLM providers impose rate limits in two dimensions: requests per minute (RPM) and tokens per minute (TPM). A gateway that only counts requests will still blow through a provider's TPM limit if your prompts average 2,000 tokens each. Effective rate limiting must operate on tokens, not just requests.
Rate Limiting Hierarchy
- Per API key: Each application or team gets its own key with individual TPM/RPM limits. Prevents one team from starving others.
- Per model: GPT-4o has a lower TPM limit and higher cost than GPT-4o-mini. The gateway enforces per-model limits independently so cheap model usage doesn't consume expensive model quota.
- Burst allowance: A token-bucket algorithm allows short bursts above the steady-state rate (e.g., 2× the normal rate for up to 30 seconds) to handle legitimate traffic spikes without false throttling.
- Global ceiling: An organisation-wide hard limit that prevents a misconfigured loop from reaching provider limits and triggering unexpected charges.
Token-Bucket vs Sliding Window
Token-bucket accumulates tokens at a steady refill rate and consumes tokens per request. It allows bursty traffic up to the bucket capacity, which matches real LLM usage patterns well. Sliding window counts requests in the last N seconds and rejects requests that exceed the limit. It prevents abrupt limit resets at fixed intervals (the "thundering herd" problem that token-bucket avoids at boundary edges). For LLM gateways, a sliding window counter with burst headroom offers the best balance.
Spring Boot Rate Limiting Middleware
// LlmRateLimitFilter.java — Token-based rate limiter using Redis
@Component
@Order(1)
@RequiredArgsConstructor
public class LlmRateLimitFilter implements Filter {
private final RedisTemplate<String, String> redis;
private final ApiKeyService apiKeyService;
// Sliding window: 100k tokens per minute per API key
private static final long TOKEN_LIMIT_PER_MINUTE = 100_000;
private static final Duration WINDOW = Duration.ofMinutes(1);
@Override
public void doFilter(ServletRequest req, ServletResponse res,
FilterChain chain) throws IOException, ServletException {
HttpServletRequest request = (HttpServletRequest) req;
String apiKey = request.getHeader("X-API-Key");
if (apiKey == null) {
((HttpServletResponse) res).sendError(401, "Missing API key");
return;
}
ApiKeyConfig config = apiKeyService.getConfig(apiKey);
long estimatedTokens = estimateTokens(request);
String counterKey = "rate:" + apiKey + ":" + currentWindowBucket();
Long current = redis.opsForValue().increment(counterKey, estimatedTokens);
redis.expire(counterKey, WINDOW);
long limit = config.tokensPerMinute();
if (current != null && current > limit) {
HttpServletResponse response = (HttpServletResponse) res;
response.setHeader("X-RateLimit-Limit", String.valueOf(limit));
response.setHeader("X-RateLimit-Remaining", "0");
response.setHeader("Retry-After", "60");
response.sendError(429, "Token rate limit exceeded");
return;
}
long remaining = limit - (current == null ? 0 : current);
((HttpServletResponse) res).setHeader("X-RateLimit-Remaining",
String.valueOf(Math.max(0, remaining)));
chain.doFilter(req, res);
}
private long estimateTokens(HttpServletRequest request) {
// Read Content-Length as proxy; actual token counting done post-parse
int contentLength = request.getContentLength();
return Math.max(100, contentLength / 4); // ~4 chars per token heuristic
}
private String currentWindowBucket() {
return String.valueOf(System.currentTimeMillis() / 60_000);
}
}
6. PII Redaction & Prompt Safety
When your application sends user-generated content to a third-party LLM API, you are transferring that data outside your network perimeter. If that content contains personally identifiable information — names, email addresses, phone numbers, social security numbers, credit card numbers — you may be in violation of GDPR, HIPAA, PCI-DSS, or CCPA. The LLM gateway is the right enforcement point because it intercepts 100% of outbound LLM requests.
Two-Stage PII Filtering Pipeline
- Before-send filter (outbound): Scans the prompt before it leaves the gateway. PII entities are replaced with typed placeholders:
[EMAIL_1],[PHONE_1],[SSN_1]. The mapping (placeholder → real value) is stored locally in the request context. - After-receive filter (inbound): Scans the completion for any PII that leaked through (e.g., the model echoed back a phone number from the prompt). Also restores placeholder values in the completion if the downstream consumer needs them.
Detection Strategy: Regex + NER
A two-layer detection approach maximises recall while minimising false positives:
- Regex patterns for high-precision structured PII: email addresses, phone numbers (E.164 + national formats), credit card numbers (Luhn-validated), SSN/SIN/NIN patterns, IP addresses, passport numbers.
- NER (Named Entity Recognition) via a lightweight on-premise model (spaCy, AWS Comprehend, or Microsoft Presidio) for unstructured PII: person names, organisation names, addresses, dates of birth. Runs as a sidecar so it adds <10ms latency.
Audit Trail for Compliance
Every redaction event must be logged to an immutable audit store. The audit record contains: timestamp, tenant ID, request ID, PII type detected (not the value), redaction action taken, and whether any PII was detected in the response. This audit trail is essential for GDPR Article 30 (Records of Processing Activities) and for incident response if a data breach is suspected. Store the audit log separately from application logs with a 12-month retention minimum.
7. Intelligent Model Routing
Not every query requires a frontier model. "What are your business hours?" does not need GPT-4o — it needs a fast, cheap answer. Intelligent model routing uses a lightweight classifier to assess query complexity and route it to the most cost-effective model that can handle it correctly.
Complexity-Based Routing Tiers
- Tier 1 — Simple: Single-fact lookup, FAQ, classification. Route to GPT-4o-mini or Gemini Flash. Average cost: ~$0.0001/request.
- Tier 2 — Moderate: Multi-step reasoning, short summarisation, code completion. Route to Claude Haiku or Gemini Pro. Average cost: ~$0.0008/request.
- Tier 3 — Complex: Long-form reasoning, code review, complex document analysis. Route to GPT-4o or Claude Sonnet. Average cost: ~$0.005/request.
- Tier 4 — Expert: Research synthesis, complex multi-document reasoning, specialised domain tasks. Route to GPT-4o with extended context or Claude Opus. Average cost: ~$0.020/request.
Model Routing Cost Savings Table
| Traffic Mix | Without Routing | With Routing | Saving |
|---|---|---|---|
| 50% Tier-1, 30% Tier-2, 20% Tier-3 | $500/day (all GPT-4o) | $115/day | 77% |
| 70% Tier-1, 20% Tier-2, 10% Tier-3 | $500/day | $80/day | 84% |
| 30% Tier-1, 40% Tier-2, 30% Tier-3 | $500/day | $175/day | 65% |
| 20% Tier-1, 30% Tier-2, 40% Tier-3, 10% Tier-4 | $500/day | $230/day | 54% |
The routing classifier itself can be a small fine-tuned model (BERT-class, ~110M params) or a simple feature-based heuristic (prompt token count, presence of code blocks, sentence complexity score). The latter is simpler and adds virtually zero latency. A fine-tuned classifier adds ~5ms but achieves 90%+ accuracy on complexity classification, which is necessary for use cases where routing errors cause quality regression.
8. LiteLLM Proxy Setup & Java Integration
LiteLLM is an open-source proxy that exposes an OpenAI-compatible REST API and handles the provider-specific translation, fallback, semantic caching, and basic rate limiting under the hood. It is the fastest path to a production LLM gateway for most teams and integrates natively with Spring AI's OpenAiChatModel.
LiteLLM Config YAML
# litellm-config.yaml
model_list:
- model_name: gpt-4o # logical name used by Spring AI
litellm_params:
model: openai/gpt-4o
api_key: os.environ/OPENAI_API_KEY
timeout: 30
max_retries: 0 # gateway handles retries, not the model
model_info:
cost_per_input_token: 0.000005
cost_per_output_token: 0.000015
- model_name: gpt-4o # fallback entry (same logical name)
litellm_params:
model: anthropic/claude-3-5-sonnet-20241022
api_key: os.environ/ANTHROPIC_API_KEY
timeout: 30
max_retries: 0
- model_name: gpt-4o # second fallback
litellm_params:
model: gemini/gemini-1.5-pro-latest
api_key: os.environ/GOOGLE_API_KEY
timeout: 30
max_retries: 0
- model_name: gpt-4o-mini
litellm_params:
model: openai/gpt-4o-mini
api_key: os.environ/OPENAI_API_KEY
timeout: 15
router_settings:
routing_strategy: least-busy # alternatives: simple-shuffle, latency-based
retry_policy:
BadRequestError: { num_retries: 0 }
AuthenticationError: { num_retries: 0 }
TimeoutError: { num_retries: 2 }
RateLimitError: { num_retries: 3 }
ServiceUnavailableError: { num_retries: 3 }
fallbacks:
- { "gpt-4o": ["claude-3-5-sonnet", "gemini-pro"] }
litellm_settings:
success_callback: ["langfuse"]
failure_callback: ["langfuse"]
cache: true
cache_params:
type: redis
host: redis
port: 6379
password: os.environ/REDIS_PASSWORD
ttl: 86400 # 24-hour TTL
similarity_threshold: 0.93 # semantic cache threshold
general_settings:
master_key: os.environ/LITELLM_MASTER_KEY
alerting: ["slack"]
alerting_threshold: 120 # alert if any request > 120s
spend_logs: true
Docker Compose Setup
# docker-compose.yml — LiteLLM + Redis + PostgreSQL
services:
litellm:
image: ghcr.io/berriai/litellm:main-latest
ports:
- "4000:4000"
volumes:
- ./litellm-config.yaml:/app/config.yaml
environment:
OPENAI_API_KEY: ${OPENAI_API_KEY}
ANTHROPIC_API_KEY: ${ANTHROPIC_API_KEY}
GOOGLE_API_KEY: ${GOOGLE_API_KEY}
REDIS_PASSWORD: ${REDIS_PASSWORD}
LITELLM_MASTER_KEY: ${LITELLM_MASTER_KEY}
LANGFUSE_PUBLIC_KEY: ${LANGFUSE_PUBLIC_KEY}
LANGFUSE_SECRET_KEY: ${LANGFUSE_SECRET_KEY}
DATABASE_URL: postgresql://litellm:${POSTGRES_PASSWORD}@postgres:5432/litellm
command: --config /app/config.yaml --port 4000 --num_workers 4
depends_on:
- redis
- postgres
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:4000/health/liveliness"]
interval: 30s
timeout: 10s
retries: 3
restart: unless-stopped
redis:
image: redis/redis-stack:latest
ports:
- "6379:6379"
- "8001:8001" # RedisInsight UI
environment:
REDIS_ARGS: "--requirepass ${REDIS_PASSWORD}"
volumes:
- redis-data:/data
restart: unless-stopped
postgres:
image: postgres:16-alpine
environment:
POSTGRES_DB: litellm
POSTGRES_USER: litellm
POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
volumes:
- postgres-data:/var/lib/postgresql/data
restart: unless-stopped
volumes:
redis-data:
postgres-data:
Spring AI Pointing to LiteLLM Proxy
# application.yml — Spring AI → LiteLLM proxy
spring:
ai:
openai:
api-key: ${LITELLM_VIRTUAL_KEY} # Virtual key issued by LiteLLM
base-url: http://litellm:4000 # LiteLLM proxy endpoint
chat:
options:
model: gpt-4o # Logical model name from litellm-config.yaml
temperature: 0.7
max-tokens: 2048
# The Spring AI application code is completely unchanged.
# LiteLLM handles: fallbacks, caching, cost tracking, routing, PII.
// ChatService.java — unchanged from single-provider implementation
@Service
@RequiredArgsConstructor
public class ChatService {
private final ChatClient chatClient;
public String ask(String question) {
return chatClient.prompt()
.user(question)
.call()
.content();
}
public Flux<String> streamAnswer(String question) {
return chatClient.prompt()
.user(question)
.stream()
.content();
}
}
This is the key architectural win: by routing Spring AI through LiteLLM, you gain provider fallbacks, semantic caching, cost tracking, and model routing without writing a single additional line of application code. The LiteLLM admin UI (port 4000) provides a real-time spend dashboard, request logs, and virtual key management.
9. Observability: Tracing, Metrics & Alerts
An LLM gateway without observability is a black box. You need to know: which prompts are slow, which are expensive, which are failing, and why. A complete observability stack for an LLM gateway combines distributed tracing, custom metrics, and structured logging.
LangFuse / LangSmith Integration
LangFuse (open source) and LangSmith (commercial) are purpose-built LLM observability platforms. LiteLLM integrates with both via a success/failure callback. Every LLM request generates a trace that includes: prompt tokens, completion tokens, latency, model used, cost, cache hit/miss, and the full prompt and response for debugging. In production, enable prompt/response logging only for sampled requests (e.g., 10%) to manage storage costs and PII risk.
OpenTelemetry Traces
LiteLLM exports OpenTelemetry spans for each request. These integrate with your existing distributed tracing infrastructure (Jaeger, Tempo, AWS X-Ray). Each LLM gateway span contains: provider name, model name, input/output token counts, latency breakdown (queue time, provider TTFB, streaming duration), and fallback events. Spring AI's observability support adds parent spans from the application, so you can see the full call chain from HTTP request → application logic → LLM gateway → provider.
Key Metrics to Alert On
- llm_gateway_error_rate > 1% over 5 minutes → PagerDuty alert
- llm_gateway_p99_latency > 15s → Warning; >30s → Critical
- llm_fallback_rate > 5% → Provider degradation; investigate primary provider
- llm_cache_hit_rate < 15% → Cache threshold may need tuning, or unusual traffic pattern
- llm_cost_usd_per_hour > $50 → Possible runaway loop; check for agent recursion
- llm_pii_detected_count > 100/min → Data quality issue; check upstream application
- llm_budget_utilisation > 80% for any tenant → Proactive notification to team lead
Grafana Dashboard for LLM Ops
A well-structured Grafana dashboard for LLM operations should have four rows: (1) Health — success rate, error rate, fallback rate by provider; (2) Latency — p50/p95/p99 by model with time-of-day heatmap; (3) Cost — hourly spend, daily burn rate vs budget, spend by tenant/model; (4) Cache — hit rate trend, cache size, eviction rate. LiteLLM exports all underlying metrics to Prometheus; the Grafana dashboard can be bootstrapped from the official LiteLLM dashboard template in Grafana Labs.
10. Production Patterns & Anti-Patterns
After operating LLM gateways at scale, a set of patterns and anti-patterns emerge. Knowing them before they become incidents saves significant operational pain.
Proven Production Patterns
Multi-Region Gateway Deployment
Deploy the LLM gateway in every region where your application runs. A gateway in us-east-1 routing requests from eu-west-1 users adds 80–120ms of unnecessary latency and creates a transatlantic SPOF. Use a global load balancer (AWS Global Accelerator, Cloudflare) to route users to the nearest gateway instance. Each regional gateway maintains its own Redis cache, so cache entries are warm in each region independently.
Canary Model Rollout
When introducing a new model version (e.g., migrating from GPT-4o-2024-11 to GPT-4o-2025-04), use the gateway's traffic splitting to send 5% of requests to the new model. Run your automated LLM evaluation suite against both versions in parallel. Gradually ramp to 100% once quality metrics show no regression. Never migrate the entire fleet to a new model version in one cut-over.
Shadow Traffic Testing
Before promoting a new routing rule or model configuration to production, mirror 10% of live traffic to the shadow configuration. Compare response quality, latency, and cost between the live and shadow configurations using automated evaluation. This catches regressions that synthetic test suites miss because they don't replicate the exact distribution of real production queries.
Common Anti-Patterns to Avoid
No Timeout on Gateway Requests
The most common production incident. Without a hard timeout on the gateway (separate from the provider's own timeout), a slow provider response occupies a thread indefinitely. During a provider degradation event, 100 concurrent requests each holding a thread for 60+ seconds will exhaust your thread pool and cause a cascading failure across all gateway traffic. Set a gateway-level timeout of 30–45s for non-streaming requests and ensure it fires before any upstream connection timeout.
No Cache TTL (or Infinite TTL)
A semantic cache without TTL serves stale answers indefinitely. If your knowledge base changes (e.g., product pricing, policy documents, code API), cached completions will return outdated information with no indication that anything is wrong. Always set TTL based on how frequently your underlying data changes. For real-time systems, TTL should be minutes. For stable reference content, 24–48 hours is safe.
Caching Responses with Dynamic Context
Semantic cache keys should be derived from the user prompt only. If your application injects dynamic context into the system prompt (user profile, current date, inventory state), the same user question will have different optimal answers for different contexts. Include stable contextual dimensions (e.g., user role, language) in the cache key, but exclude high-cardinality or real-time injected data, or disable caching entirely for those prompt templates.
11. Conclusion & Implementation Checklist
An LLM gateway is not optional for any production AI system serving real users. Provider outages, unchecked token costs, and PII leaks are all "when, not if" events at scale. The good news is that open-source solutions like LiteLLM make it possible to deploy a production-grade gateway in a single afternoon, and Spring AI's OpenAI-compatible client means your Java application code requires zero changes.
The compounding effect is significant: semantic caching (30–50% cost reduction) + intelligent routing (50–80% cost reduction on routed traffic) + provider fallbacks (99.9%+ availability) together transform what is often an unpredictable, fragile, and expensive component into a reliable, cost-predictable platform service.
Production LLM Gateway Checklist
- ☐ Deploy LiteLLM proxy (or equivalent) as a dedicated gateway service with at least 2 replicas
- ☐ Configure provider fallback chain covering at least 2 independent providers
- ☐ Add circuit breakers (Resilience4j) with per-model error rate thresholds
- ☐ Enable Redis semantic caching with cosine similarity threshold tuned per use case
- ☐ Set cache TTL based on knowledge base update frequency (never use no-TTL)
- ☐ Implement per-tenant token budgets with 429 enforcement and Retry-After headers
- ☐ Deploy PII redaction filter (regex + NER) on all outbound LLM requests
- ☐ Write PII audit log to immutable store with 12-month minimum retention
- ☐ Enable cost tracking per request with Prometheus metrics and Grafana dashboard
- ☐ Set spend alerts at 50%, 80%, and 95% of monthly budget per tenant
- ☐ Configure intelligent model routing with complexity classifier (or heuristic)
- ☐ Set gateway-level timeouts (30–45s) independently of provider timeouts
Start with steps 1–3 (gateway + fallbacks + circuit breakers) for immediate availability improvement. Add semantic caching (step 4–5) next for the biggest cost win. PII redaction and budget enforcement (steps 6–10) are compliance-critical for any user-facing application. Model routing and tuning (steps 11–12) are the optimisation layer that you add once the foundation is solid.