Breaking a Monolith Into 47 ServicesβThen Spending 18 Months Putting It Back Together
A common story in the industry. An initial microservices architecture becomes a distributed monolithβ47 services that all need to be deployed together, can't function independently, and turn every feature release into a coordination nightmare. Latency triples. Debugging takes hours instead of minutes. Teams spend more time on infrastructure than on features.
The lesson isn't that microservices are bad. It's that microservices without discipline are worse than a well-structured monolith. After 18 months of consolidation and redesign, platforms arrive at 12 services that actually work. Here are the key lessons.
When Microservices Make Sense (And When They Don't)
The Microservices Decision Framework
Before splitting anything, ask:
1. Do you have independent scaling requirements?
If your AI inference service needs 100 GPUs while your user authentication needs 2 CPUs, separate services make sense. If everything scales together, you're adding complexity without benefit.
2. Do different teams own different domains?
Microservices enable team autonomy. If a single team owns everything, you're creating communication overhead with no organizational benefit.
3. Do you need independent deployability?
If you genuinely need to deploy the payment service without touching the notification service, microservices help. If you always deploy everything together anyway, you have a distributed monolith.
4. Are your domain boundaries clear and stable?
Splitting on uncertain boundaries means constant refactoring as you learn your domain. Start with a modular monolith; extract services as boundaries stabilize.
What We Should Have Done
Phase 1 (Month 0-6): Modular Monolith
- Single deployable with clearly separated modules
- Each module with its own database schema
- Internal APIs between modules, but single transaction context
- Team ownership at the module level
Phase 2 (Month 6-12): Strategic Extraction
- Identify the module with the clearest boundary and most independent scaling need
- Extract to a service, keep everything else monolithic
- Learn operational patterns with low blast radius
Phase 3 (Ongoing): Evolutionary Architecture
- Extract additional services only when specific benefits are clear
- Consolidate services that didn't earn their complexity
Essential Microservices Patterns for AI Applications
Pattern 1: API Gateway with AI-Specific Routing
AI workloads have unique routing needs: model selection, fallback handling, and quota enforcement must happen before requests reach backend services.
Architecture:
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β API Gateway β
β ββββββββββββββ¬βββββββββββββ¬βββββββββββββββββ β
β β Auth & β Rate β AI Model β β
β β Quota β Limiting β Router β β
β ββββββββββββββ΄βββββββββββββ΄βββββββββββββββββ β
ββββββββββββββββββββββββ¬βββββββββββββββββββββββββββ
β
βββββββββββββββββΌββββββββββββββββ
β β β
βΌ βΌ βΌ
βββββββββββββββ βββββββββββββββ βββββββββββββββ
β Chat β β Image β β Embeddings β
β Service β β Service β β Service β
βββββββββββββββ βββββββββββββββ βββββββββββββββ
Key gateway responsibilities:
- Validate API keys and check credit balances before forwarding
- Route to appropriate service based on request type
- Implement circuit breakers for downstream services
- Collect cross-cutting metrics (latency, error rates, token usage)
Our implementation: We use Kong Gateway with custom plugins for AI-specific logic. The model router plugin examines the request body, determines the optimal backend based on the requested model and current system health, and injects routing headers.
Pattern 2: Saga Pattern for Multi-Model Workflows
Complex AI workflows often involve multiple models in sequence. What happens when step 3 of 5 fails?
Saga orchestration approach:
Workflow: Document Analysis Pipeline
Step 1: Extract text (OCR Service) β On failure: return error
Step 2: Summarize (LLM Service) β On failure: return partial result
Step 3: Classify (ML Service) β On failure: use default classification
Step 4: Store results (Storage Service) β On failure: queue for retry
Step 5: Notify (Notification Service) β On failure: log, don't block
Compensation strategies:
- Forward recovery: Retry with exponential backoff
- Fallback: Use a cheaper/faster alternative model
- Partial completion: Return results from completed steps
- Full rollback: Undo all steps (rarely needed for read-heavy AI workloads)
Implementation tip: Use a workflow engine (Temporal, Cadence, or even a simple state machine) rather than hardcoding saga logic. Explicit state management makes debugging and recovery vastly easier.
Pattern 3: Sidecar Pattern for Model Inference
Deploy inference containers as sidecars to your application containers, not as separate services.
Why:
- Eliminates network latency for inference calls
- Simplifies service discovery (always localhost)
- Enables tight resource coupling (scale pods, not services)
Architecture:
Kubernetes pod spec
spec:
containers:
- name: application
image: myapp:latest
ports:
- containerPort: 8080 - name: inference-sidecar
image: llama-inference:latest
ports:
- containerPort: 8081 # Only accessible within pod
resources:
limits:
nvidia.com/gpu: 1
When to use: Low-latency inference requirements, data locality needs (large context that's expensive to transfer), or regulated environments where data shouldn't leave the pod.
When not to use: Shared model serving across many applications (use a dedicated model service instead), or when pods would become too resource-heavy.
Pattern 4: Event-Driven AI Processing
Decouple AI processing from synchronous request/response where possible.
Event flow:
User Request β API β Message Queue β AI Workers β Results Store β Notification
β β
βββββββββ Retry Queue ββββββββββ
Message schema for AI tasks:
{
"task_id": "uuid",
"task_type": "image_generation",
"priority": "standard",
"parameters": {
"model": "dall-e-3",
"prompt": "...",
"size": "1024x1024"
},
"callback_url": "https://...",
"max_retries": 3,
"timeout_seconds": 120,
"created_at": "2024-12-15T10:30:00Z"
}
Benefits:
- Natural backpressure handling (queue depth = load indicator)
- Retry and dead-letter queue patterns built-in
- Easier horizontal scaling (add workers as needed)
- Decoupled failure domains
Pattern 5: Service Mesh for AI Observability
AI services need observability beyond standard HTTP metrics. A service mesh like Istio or Linkerd provides the foundation; custom metrics provide AI-specific insights.
Essential AI metrics to collect:
- Token usage per model per customer
- Model latency distribution (P50, P95, P99)
- Cache hit rates for repeated prompts
- Error rates by error type (rate limit, context length, content filter)
- Cost per request (calculated from token usage)
Distributed tracing setup:
Every AI request should carry a trace ID through:
- API Gateway (trace created)
- Routing decision (span: model selection)
- Provider call (span: inference, includes model and token count)
- Post-processing (span: output formatting)
- Response (trace completed)
Our tracing annotations:
@tracer.span("ai_inference")
async def call_model(request: InferenceRequest):
span = tracer.current_span()
span.set_attribute("model", request.model)
span.set_attribute("input_tokens", count_tokens(request.prompt)) response = await provider.complete(request)
span.set_attribute("output_tokens", count_tokens(response.text))
span.set_attribute("provider_latency_ms", response.latency_ms)
return response
Anti-Patterns We Learned the Hard Way
Anti-Pattern 1: Chatty Services
What we did wrong: Our "user service" made 7 API calls to other services to construct a user profile.
The problem: 7 sequential calls = latency multiplication. One slow service = everything slow.
The fix: Denormalize read models. The user service stores everything it needs to answer queries without calling other services. Updates propagate via events.
Anti-Pattern 2: Shared Database
What we did wrong: Three services all accessed the same PostgreSQL database "for simplicity."
The problem: Schema changes required coordinating three teams. One service's heavy query load affected all three. We had coupling without the benefits.
The fix: Database-per-service, no exceptions. Services communicate via APIs or events, never via shared database access.
Anti-Pattern 3: Synchronous Everything
What we did wrong: Every service call was synchronous HTTP. A chain of 5 services meant the slowest service determined overall latency.
The problem: Cascading failures. When the slowest service degraded, everything degraded.
The fix: Async by default. If a service doesn't need an immediate response, publish an event. Use synchronous calls only when the caller genuinely needs to wait.
Anti-Pattern 4: Under-Instrumented Services
What we did wrong: Deployed services without comprehensive metrics. "We'll add observability later."
The problem: When things broke, we couldn't diagnose why. Hours spent adding instrumentation during incidents.
The fix: Observability is not optional. Every service ships with: health endpoints, structured logging, metrics export, and trace propagation. No exceptions.
The 12-Service Architecture That Works
After consolidation, our AI platform runs on 12 services:
- Gateway - Authentication, routing, rate limiting
- Chat - Conversational AI workloads
- Completion - Single-turn text generation
- Embeddings - Vector generation
- Images - Image generation and analysis
- Audio - Speech-to-text, text-to-speech
- Credits - Usage tracking and billing
- Models - Model registry and health monitoring
- Jobs - Async task processing
- Storage - File and result storage
- Events - Event bus and notifications
- Admin - Internal tooling and dashboards
Each service has clear ownership, independent scaling characteristics, and can be deployed without affecting others. This wasn't our starting pointβit's where we arrived after learning what boundaries actually matter.
The best microservices architecture is the one you can operate. Start simpler than you think you need, add complexity only when you've earned it, and always remember: distributed systems are hard, and every network call is a potential failure point.