Share
Engineering

Microservices Patterns for Modern AI Applications

Microservices Patterns for Modern AI Applications

Breaking a Monolith Into 47 Servicesβ€”Then Spending 18 Months Putting It Back Together

A common story in the industry. An initial microservices architecture becomes a distributed monolithβ€”47 services that all need to be deployed together, can't function independently, and turn every feature release into a coordination nightmare. Latency triples. Debugging takes hours instead of minutes. Teams spend more time on infrastructure than on features.

The lesson isn't that microservices are bad. It's that microservices without discipline are worse than a well-structured monolith. After 18 months of consolidation and redesign, platforms arrive at 12 services that actually work. Here are the key lessons.

When Microservices Make Sense (And When They Don't)

The Microservices Decision Framework

Before splitting anything, ask:

1. Do you have independent scaling requirements?
If your AI inference service needs 100 GPUs while your user authentication needs 2 CPUs, separate services make sense. If everything scales together, you're adding complexity without benefit.

2. Do different teams own different domains?
Microservices enable team autonomy. If a single team owns everything, you're creating communication overhead with no organizational benefit.

3. Do you need independent deployability?
If you genuinely need to deploy the payment service without touching the notification service, microservices help. If you always deploy everything together anyway, you have a distributed monolith.

4. Are your domain boundaries clear and stable?
Splitting on uncertain boundaries means constant refactoring as you learn your domain. Start with a modular monolith; extract services as boundaries stabilize.

What We Should Have Done

Phase 1 (Month 0-6): Modular Monolith

  • Single deployable with clearly separated modules

  • Each module with its own database schema

  • Internal APIs between modules, but single transaction context

  • Team ownership at the module level

Phase 2 (Month 6-12): Strategic Extraction

  • Identify the module with the clearest boundary and most independent scaling need

  • Extract to a service, keep everything else monolithic

  • Learn operational patterns with low blast radius

Phase 3 (Ongoing): Evolutionary Architecture

  • Extract additional services only when specific benefits are clear

  • Consolidate services that didn't earn their complexity

Essential Microservices Patterns for AI Applications

Pattern 1: API Gateway with AI-Specific Routing

AI workloads have unique routing needs: model selection, fallback handling, and quota enforcement must happen before requests reach backend services.

Architecture:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ API Gateway β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Auth & β”‚ Rate β”‚ AI Model β”‚ β”‚
β”‚ β”‚ Quota β”‚ Limiting β”‚ Router β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ β”‚ β”‚
β–Ό β–Ό β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Chat β”‚ β”‚ Image β”‚ β”‚ Embeddings β”‚
β”‚ Service β”‚ β”‚ Service β”‚ β”‚ Service β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key gateway responsibilities:

  • Validate API keys and check credit balances before forwarding

  • Route to appropriate service based on request type

  • Implement circuit breakers for downstream services

  • Collect cross-cutting metrics (latency, error rates, token usage)

Our implementation: We use Kong Gateway with custom plugins for AI-specific logic. The model router plugin examines the request body, determines the optimal backend based on the requested model and current system health, and injects routing headers.

Pattern 2: Saga Pattern for Multi-Model Workflows

Complex AI workflows often involve multiple models in sequence. What happens when step 3 of 5 fails?

Saga orchestration approach:

Workflow: Document Analysis Pipeline
Step 1: Extract text (OCR Service) β†’ On failure: return error
Step 2: Summarize (LLM Service) β†’ On failure: return partial result
Step 3: Classify (ML Service) β†’ On failure: use default classification
Step 4: Store results (Storage Service) β†’ On failure: queue for retry
Step 5: Notify (Notification Service) β†’ On failure: log, don't block

Compensation strategies:

  • Forward recovery: Retry with exponential backoff

  • Fallback: Use a cheaper/faster alternative model

  • Partial completion: Return results from completed steps

  • Full rollback: Undo all steps (rarely needed for read-heavy AI workloads)

Implementation tip: Use a workflow engine (Temporal, Cadence, or even a simple state machine) rather than hardcoding saga logic. Explicit state management makes debugging and recovery vastly easier.

Pattern 3: Sidecar Pattern for Model Inference

Deploy inference containers as sidecars to your application containers, not as separate services.

Why:

  • Eliminates network latency for inference calls

  • Simplifies service discovery (always localhost)

  • Enables tight resource coupling (scale pods, not services)

Architecture:

Kubernetes pod spec

spec:
containers:
- name: application
image: myapp:latest
ports:
- containerPort: 8080

- name: inference-sidecar
image: llama-inference:latest
ports:
- containerPort: 8081 # Only accessible within pod
resources:
limits:
nvidia.com/gpu: 1

When to use: Low-latency inference requirements, data locality needs (large context that's expensive to transfer), or regulated environments where data shouldn't leave the pod.

When not to use: Shared model serving across many applications (use a dedicated model service instead), or when pods would become too resource-heavy.

Pattern 4: Event-Driven AI Processing

Decouple AI processing from synchronous request/response where possible.

Event flow:

User Request β†’ API β†’ Message Queue β†’ AI Workers β†’ Results Store β†’ Notification
↑ β”‚
└──────── Retry Queue β†β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Message schema for AI tasks:

{
"task_id": "uuid",
"task_type": "image_generation",
"priority": "standard",
"parameters": {
"model": "dall-e-3",
"prompt": "...",
"size": "1024x1024"
},
"callback_url": "https://...",
"max_retries": 3,
"timeout_seconds": 120,
"created_at": "2024-12-15T10:30:00Z"
}

Benefits:

  • Natural backpressure handling (queue depth = load indicator)

  • Retry and dead-letter queue patterns built-in

  • Easier horizontal scaling (add workers as needed)

  • Decoupled failure domains

Pattern 5: Service Mesh for AI Observability

AI services need observability beyond standard HTTP metrics. A service mesh like Istio or Linkerd provides the foundation; custom metrics provide AI-specific insights.

Essential AI metrics to collect:

  • Token usage per model per customer

  • Model latency distribution (P50, P95, P99)

  • Cache hit rates for repeated prompts

  • Error rates by error type (rate limit, context length, content filter)

  • Cost per request (calculated from token usage)

Distributed tracing setup:
Every AI request should carry a trace ID through:

  • API Gateway (trace created)

  • Routing decision (span: model selection)

  • Provider call (span: inference, includes model and token count)

  • Post-processing (span: output formatting)

  • Response (trace completed)

Our tracing annotations:

@tracer.span("ai_inference")
async def call_model(request: InferenceRequest):
span = tracer.current_span()
span.set_attribute("model", request.model)
span.set_attribute("input_tokens", count_tokens(request.prompt))

response = await provider.complete(request)

span.set_attribute("output_tokens", count_tokens(response.text))
span.set_attribute("provider_latency_ms", response.latency_ms)
return response

Anti-Patterns We Learned the Hard Way

Anti-Pattern 1: Chatty Services

What we did wrong: Our "user service" made 7 API calls to other services to construct a user profile.

The problem: 7 sequential calls = latency multiplication. One slow service = everything slow.

The fix: Denormalize read models. The user service stores everything it needs to answer queries without calling other services. Updates propagate via events.

Anti-Pattern 2: Shared Database

What we did wrong: Three services all accessed the same PostgreSQL database "for simplicity."

The problem: Schema changes required coordinating three teams. One service's heavy query load affected all three. We had coupling without the benefits.

The fix: Database-per-service, no exceptions. Services communicate via APIs or events, never via shared database access.

Anti-Pattern 3: Synchronous Everything

What we did wrong: Every service call was synchronous HTTP. A chain of 5 services meant the slowest service determined overall latency.

The problem: Cascading failures. When the slowest service degraded, everything degraded.

The fix: Async by default. If a service doesn't need an immediate response, publish an event. Use synchronous calls only when the caller genuinely needs to wait.

Anti-Pattern 4: Under-Instrumented Services

What we did wrong: Deployed services without comprehensive metrics. "We'll add observability later."

The problem: When things broke, we couldn't diagnose why. Hours spent adding instrumentation during incidents.

The fix: Observability is not optional. Every service ships with: health endpoints, structured logging, metrics export, and trace propagation. No exceptions.

The 12-Service Architecture That Works

After consolidation, our AI platform runs on 12 services:

  • Gateway - Authentication, routing, rate limiting

  • Chat - Conversational AI workloads

  • Completion - Single-turn text generation

  • Embeddings - Vector generation

  • Images - Image generation and analysis

  • Audio - Speech-to-text, text-to-speech

  • Credits - Usage tracking and billing

  • Models - Model registry and health monitoring

  • Jobs - Async task processing

  • Storage - File and result storage

  • Events - Event bus and notifications

  • Admin - Internal tooling and dashboards

Each service has clear ownership, independent scaling characteristics, and can be deployed without affecting others. This wasn't our starting pointβ€”it's where we arrived after learning what boundaries actually matter.

The best microservices architecture is the one you can operate. Start simpler than you think you need, add complexity only when you've earned it, and always remember: distributed systems are hard, and every network call is a potential failure point.

Ricardo Mendes

About the Author

Ricardo Mendes

Co-founder of AIOBI. Computer Engineer with experience in data analysis, software, and financial management.