Share
Engineering

Building Scalable SaaS Architecture: Lessons from Multi-Model AI Platforms

Building Scalable SaaS Architecture: Lessons from Multi-Model AI Platforms

How 14 Minutes of Downtime Can Cost €47,000 in Refunds

Imagine this scenario: a SaaS platform's primary database reaches 100% CPU utilization. Within 14 minutes, the entire platform becomes unresponsive. During that window, 2,847 users lose access mid-workflow. AI model requests queue and time out. Credit transactions fail silently.

The root cause? A single customer running a bulk export of 340,000 records through an API endpoint that wasn't properly rate-limited. One tenant bringing down the entire platform.

This type of incident forces complete architectural rethinking. The patterns below show how platforms evolve to serve 3x traffic with 99.97% uptime after learning these hard lessons.

The Multi-Tenancy Decision Tree

Before writing a single line of code, you must decide how tenants share (or don't share) resources. This decision affects everything downstream.

Shared Everything (Database-Level Multi-Tenancy)

All tenants share the same database, distinguished by a tenant_id column on every table.

Advantages:

  • Lowest infrastructure cost per tenant

  • Simplest deployment model

  • Easy cross-tenant analytics

Disadvantages:

  • Noisy neighbor problem (one tenant affects all others)

  • Complex data isolation (every query must filter by tenant)

  • Regulatory challenges (some industries require physical separation)

When to use: Early-stage SaaS with budget constraints, homogeneous customer base, no strict compliance requirements.

Common mistake: Many platforms start here. When Enterprise customers ask for dedicated resources, they can't offer them without a major migration.

Shared Infrastructure, Isolated Databases

Each tenant gets their own database but shares application servers and infrastructure.

Advantages:

  • Strong data isolation

  • Per-tenant backup and restore

  • Easier compliance story

Disadvantages:

  • Connection pool management becomes complex

  • Schema migrations must run across all databases

  • Higher operational overhead

When to use: Regulated industries, customers with data residency requirements, B2B SaaS with contractual isolation requirements.

Fully Isolated (Single-Tenant Deployments)

Each tenant gets dedicated infrastructureβ€”their own application instances, databases, and often their own Kubernetes namespace or cloud account.

Advantages:

  • Complete isolation (performance, security, compliance)

  • Per-tenant customization possible

  • Clean security boundaries

Disadvantages:

  • Highest cost per tenant

  • Deployment complexity scales linearly with customers

  • Upgrades become a coordination nightmare

When to use: Enterprise customers paying €10k+/month, government contracts, highly regulated industries.

Our Hybrid Approach

After the March incident, we implemented tiered isolation:

Standard tier: Shared database with row-level security. Tenant data filtered at the application layer and enforced at the database layer using PostgreSQL RLS policies.

Professional tier: Isolated database per tenant, shared application infrastructure. Connection pooling via PgBouncer with per-tenant pools.

Enterprise tier: Dedicated Kubernetes namespace with isolated compute, storage, and networking. Effectively single-tenant within shared infrastructure.

This hybrid model increased our infrastructure costs by 40% but eliminated cross-tenant performance impact entirely.

Designing for 100+ AI Models

The unique challenge of multi-model AI platforms: routing requests to 100+ AI model providers, each with different rate limits, pricing, latency characteristics, and failure modes.

The Gateway Pattern

Every AI request flows through a unified gateway that handles:

                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ API Gateway β”‚
β”‚ (Rate Limits) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Router Service β”‚
β”‚ (Model Selection)β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ β”‚ β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”
β”‚ OpenAI β”‚ β”‚ Anthropic β”‚ β”‚ Google β”‚
β”‚ Adapter β”‚ β”‚ Adapter β”‚ β”‚ Adapter β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Router Service logic:

  • Check user's tier and credit balance

  • Determine optimal model based on request type and user preference

  • Check provider health scores (maintained by background monitors)

  • Apply fallback rules if primary provider is degraded

  • Route to appropriate adapter

Adapter responsibilities:

  • Translate our unified API format to provider-specific format

  • Handle authentication and credential rotation

  • Implement circuit breaker patterns

  • Collect latency and error metrics

Credit System Architecture

Credits are the universal currency across all AI operations. The credit system must be:

  • Accurate: Users trust their balance

  • Fast: Cannot add latency to every request

  • Consistent: No double-spending, no lost credits

Our implementation uses an event-sourced pattern:

-- Credit events are append-only
CREATE TABLE credit_events (
id UUID PRIMARY KEY,
tenant_id UUID NOT NULL,
event_type VARCHAR(50), -- 'purchase', 'usage', 'refund', 'adjustment'
amount DECIMAL(10, 4),
model_id VARCHAR(100),
request_id UUID,
created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW()
);

-- Materialized balance for fast reads
CREATE MATERIALIZED VIEW credit_balances AS
SELECT
tenant_id,
SUM(CASE WHEN event_type IN ('purchase', 'refund') THEN amount ELSE -amount END) as balance
FROM credit_events
GROUP BY tenant_id;

For real-time balance checks, we maintain an in-memory cache (Redis) that's updated asynchronously after each transaction. The materialized view serves as the authoritative source, refreshed every 5 minutes and on-demand for disputed balances.

Handling Provider Outages

AI providers go down. OpenAI had 7 significant outages in 2024. Your architecture must handle this gracefully.

Our approach: Automatic fallback with user consent

Users configure fallback preferences:

  • Allow automatic fallback to equivalent models (e.g., GPT-4 β†’ Claude 3.5 Sonnet)

  • Allow automatic fallback to lower-tier models (e.g., GPT-4 β†’ GPT-4o-mini)

  • No fallback (fail the request)

When the primary model is unavailable:

  • Circuit breaker opens after 3 consecutive failures

  • System checks user's fallback preferences

  • Routes to fallback model if permitted

  • Logs both original and fallback model for billing transparency

Credit adjustment: If a fallback model costs less, we credit the difference. If it costs more, we absorb the difference. This builds trust.

Scaling Patterns That Actually Worked

Pattern 1: Queue Everything That Can Wait

Not all requests need synchronous processing. We identified three categories:

Synchronous (< 30s expected):

  • Chat completions

  • Simple generations

  • Real-time analysis

Async-by-default (30s - 5min):

  • Image generation

  • Long document processing

  • Batch translations

Background (> 5min):

  • Fine-tuning jobs

  • Bulk exports

  • Report generation

Moving async workloads to queues (we use Amazon SQS with worker pods) reduced our API server load by 60% and improved P95 latency for synchronous requests from 4.2s to 890ms.

Pattern 2: Read Replicas for Analytics

User dashboards showing usage history, credit consumption, and model performance were hammering our primary database.

Solution: Dedicated read replica for all analytics queries. The 2-second replication lag is invisible for historical dashboards.

Implementation detail: We route queries based on the endpoint. /api/analytics/* routes to the read replica; everything else hits the primary.

Pattern 3: Edge Caching for Static AI Responses

Some AI queries are surprisingly cacheable:

  • Embeddings for the same text

  • Classifications for identical inputs

  • Translations of identical phrases

We implemented a semantic cache at the edge (Cloudflare Workers KV):

  • Hash the input + model + relevant parameters

  • Check cache before routing to provider

  • Cache hit rate: 23% across all requests

  • Saved approximately €8,000/month in API costs

Cache key strategy:

const cacheKey = crypto.createHash('sha256')
.update(JSON.stringify({
model: request.model,
input: request.input,
temperature: request.temperature, // Only if deterministic (0)
// Exclude parameters that cause variation
}))
.digest('hex');

Monitoring and Observability

You cannot scale what you cannot see. Our observability stack:

Metrics (Prometheus + Grafana):

  • Request rate by model, tenant, and endpoint

  • Latency percentiles (P50, P95, P99)

  • Credit consumption velocity

  • Provider health scores

Tracing (Jaeger):

  • End-to-end request traces

  • Cross-service dependency mapping

  • Slow query identification

Alerting (PagerDuty):

  • Error rate exceeds 1% for 5 minutes

  • P99 latency exceeds 10 seconds

  • Credit system write failures

  • Provider circuit breaker opens

The March incident would have been caught 6 minutes earlier with our current monitoring. That's €20,000 saved.

Lessons Learned

1. Design for the 10x customer from day one. The customer who brought us down wasn't maliciousβ€”they were successful. Build rate limits, resource quotas, and isolation from the start.

2. Multi-tenancy mode should be upgradeable. Customers grow. What starts as a shared-database tenant may become an enterprise customer needing isolation. Plan migration paths.

3. Async by default, sync when required. Most AI workloads don't need real-time responses. Queue them. Your architecture will thank you.

4. Provider diversity is a feature, not a burden. When OpenAI goes down, having Anthropic and Google as fallbacks isn't just reliabilityβ€”it's competitive advantage.

5. Credit systems are financial systems. Treat them with the rigor of a payment processor. Audit trails, reconciliation, and anomaly detection are non-negotiable.

Building a platform that serves 100+ AI models at scale is fundamentally a distributed systems problem with AI-specific constraints. The models are the easy part. The infrastructure that makes them accessible, reliable, and economically viableβ€”that's where the real engineering happens.

Ricardo Mendes

About the Author

Ricardo Mendes

Co-founder of AIOBI. Computer Engineer with experience in data analysis, software, and financial management.