How 14 Minutes of Downtime Can Cost β¬47,000 in Refunds
Imagine this scenario: a SaaS platform's primary database reaches 100% CPU utilization. Within 14 minutes, the entire platform becomes unresponsive. During that window, 2,847 users lose access mid-workflow. AI model requests queue and time out. Credit transactions fail silently.
The root cause? A single customer running a bulk export of 340,000 records through an API endpoint that wasn't properly rate-limited. One tenant bringing down the entire platform.
This type of incident forces complete architectural rethinking. The patterns below show how platforms evolve to serve 3x traffic with 99.97% uptime after learning these hard lessons.
The Multi-Tenancy Decision Tree
Before writing a single line of code, you must decide how tenants share (or don't share) resources. This decision affects everything downstream.
Shared Everything (Database-Level Multi-Tenancy)
All tenants share the same database, distinguished by a tenant_id column on every table.
Advantages:
- Lowest infrastructure cost per tenant
- Simplest deployment model
- Easy cross-tenant analytics
Disadvantages:
- Noisy neighbor problem (one tenant affects all others)
- Complex data isolation (every query must filter by tenant)
- Regulatory challenges (some industries require physical separation)
When to use: Early-stage SaaS with budget constraints, homogeneous customer base, no strict compliance requirements.
Common mistake: Many platforms start here. When Enterprise customers ask for dedicated resources, they can't offer them without a major migration.
Shared Infrastructure, Isolated Databases
Each tenant gets their own database but shares application servers and infrastructure.
Advantages:
- Strong data isolation
- Per-tenant backup and restore
- Easier compliance story
Disadvantages:
- Connection pool management becomes complex
- Schema migrations must run across all databases
- Higher operational overhead
When to use: Regulated industries, customers with data residency requirements, B2B SaaS with contractual isolation requirements.
Fully Isolated (Single-Tenant Deployments)
Each tenant gets dedicated infrastructureβtheir own application instances, databases, and often their own Kubernetes namespace or cloud account.
Advantages:
- Complete isolation (performance, security, compliance)
- Per-tenant customization possible
- Clean security boundaries
Disadvantages:
- Highest cost per tenant
- Deployment complexity scales linearly with customers
- Upgrades become a coordination nightmare
When to use: Enterprise customers paying β¬10k+/month, government contracts, highly regulated industries.
Our Hybrid Approach
After the March incident, we implemented tiered isolation:
Standard tier: Shared database with row-level security. Tenant data filtered at the application layer and enforced at the database layer using PostgreSQL RLS policies.
Professional tier: Isolated database per tenant, shared application infrastructure. Connection pooling via PgBouncer with per-tenant pools.
Enterprise tier: Dedicated Kubernetes namespace with isolated compute, storage, and networking. Effectively single-tenant within shared infrastructure.
This hybrid model increased our infrastructure costs by 40% but eliminated cross-tenant performance impact entirely.
Designing for 100+ AI Models
The unique challenge of multi-model AI platforms: routing requests to 100+ AI model providers, each with different rate limits, pricing, latency characteristics, and failure modes.
The Gateway Pattern
Every AI request flows through a unified gateway that handles:
βββββββββββββββββββ
β API Gateway β
β (Rate Limits) β
ββββββββββ¬βββββββββ
β
ββββββββββΌβββββββββ
β Router Service β
β (Model Selection)β
ββββββββββ¬βββββββββ
β
βββββββββββββββββββββΌββββββββββββββββββββ
β β β
ββββββββββΌββββββ ββββββββββΌββββββ ββββββββββΌββββββ
β OpenAI β β Anthropic β β Google β
β Adapter β β Adapter β β Adapter β
ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ
Router Service logic:
- Check user's tier and credit balance
- Determine optimal model based on request type and user preference
- Check provider health scores (maintained by background monitors)
- Apply fallback rules if primary provider is degraded
- Route to appropriate adapter
Adapter responsibilities:
- Translate our unified API format to provider-specific format
- Handle authentication and credential rotation
- Implement circuit breaker patterns
- Collect latency and error metrics
Credit System Architecture
Credits are the universal currency across all AI operations. The credit system must be:
- Accurate: Users trust their balance
- Fast: Cannot add latency to every request
- Consistent: No double-spending, no lost credits
Our implementation uses an event-sourced pattern:
-- Credit events are append-only
CREATE TABLE credit_events (
id UUID PRIMARY KEY,
tenant_id UUID NOT NULL,
event_type VARCHAR(50), -- 'purchase', 'usage', 'refund', 'adjustment'
amount DECIMAL(10, 4),
model_id VARCHAR(100),
request_id UUID,
created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW()
);-- Materialized balance for fast reads
CREATE MATERIALIZED VIEW credit_balances AS
SELECT
tenant_id,
SUM(CASE WHEN event_type IN ('purchase', 'refund') THEN amount ELSE -amount END) as balance
FROM credit_events
GROUP BY tenant_id;
For real-time balance checks, we maintain an in-memory cache (Redis) that's updated asynchronously after each transaction. The materialized view serves as the authoritative source, refreshed every 5 minutes and on-demand for disputed balances.
Handling Provider Outages
AI providers go down. OpenAI had 7 significant outages in 2024. Your architecture must handle this gracefully.
Our approach: Automatic fallback with user consent
Users configure fallback preferences:
- Allow automatic fallback to equivalent models (e.g., GPT-4 β Claude 3.5 Sonnet)
- Allow automatic fallback to lower-tier models (e.g., GPT-4 β GPT-4o-mini)
- No fallback (fail the request)
When the primary model is unavailable:
- Circuit breaker opens after 3 consecutive failures
- System checks user's fallback preferences
- Routes to fallback model if permitted
- Logs both original and fallback model for billing transparency
Credit adjustment: If a fallback model costs less, we credit the difference. If it costs more, we absorb the difference. This builds trust.
Scaling Patterns That Actually Worked
Pattern 1: Queue Everything That Can Wait
Not all requests need synchronous processing. We identified three categories:
Synchronous (< 30s expected):
- Chat completions
- Simple generations
- Real-time analysis
Async-by-default (30s - 5min):
- Image generation
- Long document processing
- Batch translations
Background (> 5min):
- Fine-tuning jobs
- Bulk exports
- Report generation
Moving async workloads to queues (we use Amazon SQS with worker pods) reduced our API server load by 60% and improved P95 latency for synchronous requests from 4.2s to 890ms.
Pattern 2: Read Replicas for Analytics
User dashboards showing usage history, credit consumption, and model performance were hammering our primary database.
Solution: Dedicated read replica for all analytics queries. The 2-second replication lag is invisible for historical dashboards.
Implementation detail: We route queries based on the endpoint. /api/analytics/* routes to the read replica; everything else hits the primary.
Pattern 3: Edge Caching for Static AI Responses
Some AI queries are surprisingly cacheable:
- Embeddings for the same text
- Classifications for identical inputs
- Translations of identical phrases
We implemented a semantic cache at the edge (Cloudflare Workers KV):
- Hash the input + model + relevant parameters
- Check cache before routing to provider
- Cache hit rate: 23% across all requests
- Saved approximately β¬8,000/month in API costs
Cache key strategy:
const cacheKey = crypto.createHash('sha256')
.update(JSON.stringify({
model: request.model,
input: request.input,
temperature: request.temperature, // Only if deterministic (0)
// Exclude parameters that cause variation
}))
.digest('hex');
Monitoring and Observability
You cannot scale what you cannot see. Our observability stack:
Metrics (Prometheus + Grafana):
- Request rate by model, tenant, and endpoint
- Latency percentiles (P50, P95, P99)
- Credit consumption velocity
- Provider health scores
Tracing (Jaeger):
- End-to-end request traces
- Cross-service dependency mapping
- Slow query identification
Alerting (PagerDuty):
- Error rate exceeds 1% for 5 minutes
- P99 latency exceeds 10 seconds
- Credit system write failures
- Provider circuit breaker opens
The March incident would have been caught 6 minutes earlier with our current monitoring. That's β¬20,000 saved.
Lessons Learned
1. Design for the 10x customer from day one. The customer who brought us down wasn't maliciousβthey were successful. Build rate limits, resource quotas, and isolation from the start.
2. Multi-tenancy mode should be upgradeable. Customers grow. What starts as a shared-database tenant may become an enterprise customer needing isolation. Plan migration paths.
3. Async by default, sync when required. Most AI workloads don't need real-time responses. Queue them. Your architecture will thank you.
4. Provider diversity is a feature, not a burden. When OpenAI goes down, having Anthropic and Google as fallbacks isn't just reliabilityβit's competitive advantage.
5. Credit systems are financial systems. Treat them with the rigor of a payment processor. Audit trails, reconciliation, and anomaly detection are non-negotiable.
Building a platform that serves 100+ AI models at scale is fundamentally a distributed systems problem with AI-specific constraints. The models are the easy part. The infrastructure that makes them accessible, reliable, and economically viableβthat's where the real engineering happens.