Building Scalable SaaS Architecture: Lessons from Multi-Model AI Platforms

How 14 Minutes of Downtime Can Cost €47,000 in Refunds

Imagine this scenario: a SaaS platform's primary database reaches 100% CPU utilization. Within 14 minutes, the entire platform becomes unresponsive. During that window, 2,847 users lose access mid-workflow. AI model requests queue and time out. Credit transactions fail silently.

The root cause? A single customer running a bulk export of 340,000 records through an API endpoint that wasn't properly rate-limited. One tenant bringing down the entire platform.

This type of incident forces complete architectural rethinking. The patterns below show how platforms evolve to serve 3x traffic with 99.97% uptime after learning these hard lessons.

The Multi-Tenancy Decision Tree

Before writing a single line of code, you must decide how tenants share (or don't share) resources. This decision affects everything downstream.

Shared Everything (Database-Level Multi-Tenancy)

All tenants share the same database, distinguished by a tenant_id column on every table.

Advantages:

Lowest infrastructure cost per tenant

Simplest deployment model

Easy cross-tenant analytics

Disadvantages:

Noisy neighbor problem (one tenant affects all others)

Complex data isolation (every query must filter by tenant)

Regulatory challenges (some industries require physical separation)

When to use: Early-stage SaaS with budget constraints, homogeneous customer base, no strict compliance requirements.

Common mistake: Many platforms start here. When Enterprise customers ask for dedicated resources, they can't offer them without a major migration.

Shared Infrastructure, Isolated Databases

Each tenant gets their own database but shares application servers and infrastructure.

Advantages:

Strong data isolation

Per-tenant backup and restore

Easier compliance story

Disadvantages:

Connection pool management becomes complex

Schema migrations must run across all databases

Higher operational overhead

When to use: Regulated industries, customers with data residency requirements, B2B SaaS with contractual isolation requirements.

Fully Isolated (Single-Tenant Deployments)

Each tenant gets dedicated infrastructure—their own application instances, databases, and often their own Kubernetes namespace or cloud account.

Advantages:

Complete isolation (performance, security, compliance)

Per-tenant customization possible

Clean security boundaries

Disadvantages:

Highest cost per tenant

Deployment complexity scales linearly with customers

Upgrades become a coordination nightmare

When to use: Enterprise customers paying €10k+/month, government contracts, highly regulated industries.

Our Hybrid Approach

After the March incident, we implemented tiered isolation:

Standard tier: Shared database with row-level security. Tenant data filtered at the application layer and enforced at the database layer using PostgreSQL RLS policies.

Professional tier: Isolated database per tenant, shared application infrastructure. Connection pooling via PgBouncer with per-tenant pools.

Enterprise tier: Dedicated Kubernetes namespace with isolated compute, storage, and networking. Effectively single-tenant within shared infrastructure.

This hybrid model increased our infrastructure costs by 40% but eliminated cross-tenant performance impact entirely.

Designing for 100+ AI Models

The unique challenge of multi-model AI platforms: routing requests to 100+ AI model providers, each with different rate limits, pricing, latency characteristics, and failure modes.

The Gateway Pattern

Every AI request flows through a unified gateway that handles:

                    ┌─────────────────┐
                    │   API Gateway   │
                    │  (Rate Limits)  │
                    └────────┬────────┘
                             │
                    ┌────────▼────────┐
                    │  Router Service │
                    │ (Model Selection)│
                    └────────┬────────┘
                             │
         ┌───────────────────┼───────────────────┐
         │                   │                   │
┌────────▼─────┐    ┌────────▼─────┐    ┌────────▼─────┐
│   OpenAI     │    │  Anthropic   │    │   Google     │
│   Adapter    │    │   Adapter    │    │   Adapter    │
└──────────────┘    └──────────────┘    └──────────────┘

Router Service logic:

Check user's tier and credit balance

Determine optimal model based on request type and user preference

Check provider health scores (maintained by background monitors)

Apply fallback rules if primary provider is degraded

Route to appropriate adapter

Adapter responsibilities:

Translate our unified API format to provider-specific format

Handle authentication and credential rotation

Implement circuit breaker patterns

Collect latency and error metrics

Credit System Architecture

Credits are the universal currency across all AI operations. The credit system must be:

Accurate: Users trust their balance

Fast: Cannot add latency to every request

Consistent: No double-spending, no lost credits

Our implementation uses an event-sourced pattern:

-- Credit events are append-only
CREATE TABLE credit_events (
    id UUID PRIMARY KEY,
    tenant_id UUID NOT NULL,
    event_type VARCHAR(50), -- 'purchase', 'usage', 'refund', 'adjustment'
    amount DECIMAL(10, 4),
    model_id VARCHAR(100),
    request_id UUID,
    created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW()
);-- Materialized balance for fast reads
CREATE MATERIALIZED VIEW credit_balances AS
SELECT
    tenant_id,
    SUM(CASE WHEN event_type IN ('purchase', 'refund') THEN amount ELSE -amount END) as balance
FROM credit_events
GROUP BY tenant_id;

For real-time balance checks, we maintain an in-memory cache (Redis) that's updated asynchronously after each transaction. The materialized view serves as the authoritative source, refreshed every 5 minutes and on-demand for disputed balances.

Handling Provider Outages

AI providers go down. OpenAI had 7 significant outages in 2024. Your architecture must handle this gracefully.

Our approach: Automatic fallback with user consent

Users configure fallback preferences:

Allow automatic fallback to equivalent models (e.g., GPT-4 → Claude 3.5 Sonnet)

Allow automatic fallback to lower-tier models (e.g., GPT-4 → GPT-4o-mini)

No fallback (fail the request)

When the primary model is unavailable:

Circuit breaker opens after 3 consecutive failures

System checks user's fallback preferences

Routes to fallback model if permitted

Logs both original and fallback model for billing transparency

Credit adjustment: If a fallback model costs less, we credit the difference. If it costs more, we absorb the difference. This builds trust.

Scaling Patterns That Actually Worked

Pattern 1: Queue Everything That Can Wait

Not all requests need synchronous processing. We identified three categories:

Synchronous (< 30s expected):

Chat completions

Simple generations

Real-time analysis

Async-by-default (30s - 5min):

Image generation

Long document processing

Batch translations

Background (> 5min):

Fine-tuning jobs

Bulk exports

Report generation

Moving async workloads to queues (we use Amazon SQS with worker pods) reduced our API server load by 60% and improved P95 latency for synchronous requests from 4.2s to 890ms.

Pattern 2: Read Replicas for Analytics

User dashboards showing usage history, credit consumption, and model performance were hammering our primary database.

Solution: Dedicated read replica for all analytics queries. The 2-second replication lag is invisible for historical dashboards.

Implementation detail: We route queries based on the endpoint. /api/analytics/* routes to the read replica; everything else hits the primary.

Pattern 3: Edge Caching for Static AI Responses

Some AI queries are surprisingly cacheable:

Embeddings for the same text

Classifications for identical inputs

Translations of identical phrases

We implemented a semantic cache at the edge (Cloudflare Workers KV):

Hash the input + model + relevant parameters

Check cache before routing to provider

Cache hit rate: 23% across all requests

Saved approximately €8,000/month in API costs

Cache key strategy:

const cacheKey = crypto.createHash('sha256')
  .update(JSON.stringify({
    model: request.model,
    input: request.input,
    temperature: request.temperature, // Only if deterministic (0)
    // Exclude parameters that cause variation
  }))
  .digest('hex');

Monitoring and Observability

You cannot scale what you cannot see. Our observability stack:

Metrics (Prometheus + Grafana):

Request rate by model, tenant, and endpoint

Latency percentiles (P50, P95, P99)

Credit consumption velocity

Provider health scores

Tracing (Jaeger):

End-to-end request traces

Cross-service dependency mapping

Slow query identification

Alerting (PagerDuty):

Error rate exceeds 1% for 5 minutes

P99 latency exceeds 10 seconds

Credit system write failures

Provider circuit breaker opens

The March incident would have been caught 6 minutes earlier with our current monitoring. That's €20,000 saved.

Lessons Learned

1. Design for the 10x customer from day one. The customer who brought us down wasn't malicious—they were successful. Build rate limits, resource quotas, and isolation from the start.

2. Multi-tenancy mode should be upgradeable. Customers grow. What starts as a shared-database tenant may become an enterprise customer needing isolation. Plan migration paths.

3. Async by default, sync when required. Most AI workloads don't need real-time responses. Queue them. Your architecture will thank you.

4. Provider diversity is a feature, not a burden. When OpenAI goes down, having Anthropic and Google as fallbacks isn't just reliability—it's competitive advantage.

5. Credit systems are financial systems. Treat them with the rigor of a payment processor. Audit trails, reconciliation, and anomaly detection are non-negotiable.

Building a platform that serves 100+ AI models at scale is fundamentally a distributed systems problem with AI-specific constraints. The models are the easy part. The infrastructure that makes them accessible, reliable, and economically viable—that's where the real engineering happens.

Engineering