Scaling AI Products: Performance at 10,000 Users

There’s a specific moment in every product’s life when the architecture you built for “good enough” starts cracking. For us, that moment arrived at 847 concurrent users.

MĀRGA’s routing latency jumped from 8ms p50 to 340ms p50. DevOps RAG’s retrieval started timing out. RAKṢĀ queued 200+ scans with no backpressure. Our $45/day inference bill became $380/day. The Datadog dashboard looked like a Jackson Pollock painting — all red, no pattern.

This is the story of how we went from there to handling 10,000 concurrent users across three AI products, two people, and a monthly infrastructure bill that’s still less than a junior engineer’s daily coffee budget.

No fairy dust. Just engineering.

The 847-User Wall: What Broke and Why

At 847 concurrent users, three things failed simultaneously. Understanding why they failed simultaneously is the actual lesson.

The Cascade

1. MĀRGA’s in-memory routing table was protected by a single sync.RWMutex. Read contention at 800+ concurrent requests created lock convoy — goroutines piling up waiting for the read lock because a background goroutine was holding the write lock to update provider health scores every 5 seconds.

2. DevOps RAG’s embedding callswere synchronous and unbatched. Each query generated a fresh embedding via OpenAI’s API. At 200 queries/minute, we hit the 3,000 RPM rate limit. Requests queued. Timeouts cascaded.

3. RAKṢĀ had no admission control.Every scan request was accepted immediately. With 200+ pending scans, the Go runtime’s goroutine scheduler thrashed, and memory climbed past the Cloud Run container’s 2GB limit.

The root cause wasn’t any single failure. It was the absence of backpressure. None of our services could say “I’m full, come back later.” They just accepted work until they fell over.

Time: 14:23 SGT          Load: 847 concurrent users

MĀRGA        ████████████████████████████████░░ 94% CPU
             Lock convoy: 340ms p50 (budget: 20ms)
             
DevOps RAG   ████████████████████████████░░░░░░ 82% CPU
             Embedding queue: 847 pending (budget: 0)
             OpenAI RPM: 2,998/3,000 (throttled)
             
RAKṢĀ        ████████████████████████████████░░ 96% CPU
             Scan queue: 213 pending (no limit)
             Memory: 1.94GB / 2.00GB (OOM imminent)
             
Result:      3 services degraded simultaneously
             User-facing latency: 12-45 seconds
             Error rate: 23%

Phase 1: Stop the Bleeding (Week 1)

Before optimizing anything, we needed the system to degrade gracefully instead of falling over.

Admission Control: Teaching Services to Say “No”

Every service got a semaphore-based admission controller. Simple concept: define the maximum concurrent requests you can handle, reject the rest with 503 + Retry-After header.

// MĀRGA admission controller
type AdmissionController struct {
    sem     chan struct{}
    metrics *MetricsCollector
}

func NewAdmission(maxConcurrent int) *AdmissionController {
    return &AdmissionController{
        sem: make(chan struct{}, maxConcurrent),
    }
}

func (ac *AdmissionController) TryAcquire(ctx context.Context) bool {
    select {
    case ac.sem <- struct{}{}:
        ac.metrics.IncrConcurrent()
        return true
    default:
        ac.metrics.IncrRejected()
        return false
    }
}

func (ac *AdmissionController) Release() {
    <-ac.sem
    ac.metrics.DecrConcurrent()
}

// In the HTTP handler:
func (s *Server) handleRoute(w http.ResponseWriter, r *http.Request) {
    if !s.admission.TryAcquire(r.Context()) {
        w.Header().Set("Retry-After", "2")
        http.Error(w, "service at capacity", http.StatusServiceUnavailable)
        return
    }
    defer s.admission.Release()
    
    // ... actual routing logic
}

MĀRGA: 500 concurrent slots. DevOps RAG: 200. RAKṢĀ: 50 (scans are expensive).

This single change turned cascading failures into graceful degradation. At 847 users, instead of all three services crashing, excess requests got a clean 503 with a retry hint. Client-side retry with exponential backoff smoothed the load within seconds.

Lock-Free Routing Table

The sync.RWMutex on MĀRGA’s routing table was the worst bottleneck. We replaced it with atomic.Value— a lock-free read path.

// Before: Lock contention at scale
type Router struct {
    mu     sync.RWMutex
    table  map[string]*ProviderState
}

func (r *Router) Route(req *Request) *Provider {
    r.mu.RLock()         // ← 800 goroutines compete for this
    defer r.mu.RUnlock()
    // ... read routing table
}

// After: Lock-free reads via atomic swap
type Router struct {
    table  atomic.Value  // stores *routingSnapshot
}

type routingSnapshot struct {
    providers map[string]*ProviderState
    version   uint64
    updated   time.Time
}

func (r *Router) Route(req *Request) *Provider {
    snap := r.table.Load().(*routingSnapshot)  // ← zero contention
    // ... read from immutable snapshot
}

func (r *Router) updateHealth(provider string, state *ProviderState) {
    for {
        old := r.table.Load().(*routingSnapshot)
        new := old.clone()
        new.providers[provider] = state
        new.version = old.version + 1
        new.updated = time.Now()
        if r.table.CompareAndSwap(old, new) {
            break
        }
    }
}

Result: MĀRGA p50 dropped from 340ms back to 9ms. p99 dropped from 2.1s to 45ms. The routing hot path became effectively zero-cost.

Phase 2: Batch Everything (Week 2-3)

Individual optimization got us back to stable. Batching got us to scale.

Embedding Batches: 50x Throughput

DevOps RAG was making one embedding API call per query. OpenAI’s embedding endpoint accepts up to 2,048 inputs per call. We were leaving 2,047 slots on the table.

class EmbeddingBatcher:
    """Collects embedding requests, flushes in batches."""
    
    def __init__(self, max_batch=256, max_wait_ms=50):
        self.queue: asyncio.Queue[EmbeddingRequest] = asyncio.Queue()
        self.max_batch = max_batch
        self.max_wait_ms = max_wait_ms
        
    async def embed(self, text: str) -> List[float]:
        """Submit a single text, get back its embedding."""
        future = asyncio.get_event_loop().create_future()
        await self.queue.put(EmbeddingRequest(text=text, future=future))
        return await future
    
    async def _flush_loop(self):
        """Background task: collect requests, flush as batch."""
        while True:
            batch: List[EmbeddingRequest] = []
            
            # Wait for first request
            first = await self.queue.get()
            batch.append(first)
            
            # Collect more for up to max_wait_ms
            deadline = time.monotonic() + self.max_wait_ms / 1000
            while len(batch) < self.max_batch:
                remaining = deadline - time.monotonic()
                if remaining <= 0:
                    break
                try:
                    req = await asyncio.wait_for(
                        self.queue.get(), timeout=remaining
                    )
                    batch.append(req)
                except asyncio.TimeoutError:
                    break
            
            # One API call for the entire batch
            texts = [r.text for r in batch]
            try:
                response = await self.openai.embeddings.create(
                    model="text-embedding-3-small",
                    input=texts
                )
                for req, emb in zip(batch, response.data):
                    req.future.set_result(emb.embedding)
            except Exception as e:
                for req in batch:
                    req.future.set_exception(e)

The batcher waits up to 50ms to collect requests, then fires one API call. At 200 queries/minute, batches average 10-15 items. One API call instead of 15. Rate limit usage dropped from 2,998 RPM to ~200 RPM.

More importantly: embedding latency decreased. A single batched call with 15 texts takes ~80ms. Fifteen individual calls take ~80ms each but compete for rate limit slots, creating head-of-line blocking. Batching is both cheaper and faster.

LLM Response Caching

40% of MĀRGA’s requests were near-duplicates — the same prompt template with minor variations, or identical classification requests. We added a semantic cache with configurable TTL.

type SemanticCache struct {
    store      *ristretto.Cache  // High-performance concurrent cache
    embedder   Embedder
    threshold  float64           // Cosine similarity threshold (0.97)
    maxEntries int64
}

func (sc *SemanticCache) Get(req *LLMRequest) (*LLMResponse, bool) {
    // Hash the request for exact match first (fast path)
    exactKey := hashRequest(req)
    if cached, found := sc.store.Get(exactKey); found {
        sc.metrics.IncrHit("exact")
        return cached.(*LLMResponse), true
    }
    
    // Semantic similarity check for near-duplicates (slow path)
    if req.AllowSemanticCache {
        embedding := sc.embedder.Embed(req.Prompt)
        neighbors := sc.index.Search(embedding, 3)
        for _, n := range neighbors {
            if n.Similarity > sc.threshold {
                sc.metrics.IncrHit("semantic")
                return n.Response, true
            }
        }
    }
    
    sc.metrics.IncrMiss()
    return nil, false
}

Cache hit rate after one week: 38% exact matches, 7% semantic matches. That’s 45% of LLM calls eliminated entirely. At our volume, that’s ~1,100 fewer API calls per day.

Cost impact: Daily inference dropped from $380 (the 847-user crisis day) to $95. The cache alone saved $120/day.

Phase 3: Infrastructure That Scales to Zero (Week 3-4)

Cloud Run’s scale-to-zero is its best feature and its worst trap. When a container scales from 0 to 1, there’s a cold start penalty. For AI services with large model files or warm caches, that penalty is brutal.

The Cold Start Problem

Our cold start times before optimization:

Service	Cold Start	Warm Request	Penalty
MĀRGA	4.2s	9ms	467x
DevOps RAG	8.7s	120ms	72x
RAKṢĀ	6.1s	450ms	14x

DevOps RAG’s 8.7-second cold start was loading the BM25 index and cross-encoder model. MĀRGA was loading provider configurations and warming the health check circuit. RAKṢĀ was initializing Semgrep rulesets.

Minimum Instances: The $12/Month Insurance Policy

Cloud Run’s min-instances setting keeps containers warm. We set MĀRGA to min-instances: 1 and left the others at 0.

Why only MĀRGA? Because MĀRGA is the front door. Every LLM request hits MĀRGA first. A 4.2-second cold start on the router is unacceptable — users see it directly. DevOps RAG and RAKṢĀ are backend services; their cold starts are hidden behind MĀRGA’s request queue.

Cost: One always-on Cloud Run instance costs ~$12/month for our container size. That’s the insurance premium for eliminating MĀRGA cold starts.

Lazy Initialization for Everything Else

For DevOps RAG and RAKṢĀ, we moved all heavy initialization behind the first request:

class LazyRAGService:
    def __init__(self):
        self._bm25_index = None
        self._cross_encoder = None
        self._ready = asyncio.Event()
        self._init_lock = asyncio.Lock()
    
    async def _ensure_initialized(self):
        if self._ready.is_set():
            return
        async with self._init_lock:
            if self._ready.is_set():
                return
            
            # Load in parallel
            self._bm25_index, self._cross_encoder = await asyncio.gather(
                self._load_bm25(),
                self._load_cross_encoder()
            )
            self._ready.set()
    
    async def query(self, text: str) -> RAGResponse:
        await self._ensure_initialized()
        # ... use initialized components

The container starts in 800ms (just the HTTP server). First request takes 8.7s (includes initialization). All subsequent requests take 120ms. The first user pays the cold start; everyone else gets warm performance.

We combined this with a Cloud Run startup probe that pre-warms during deployment:

# cloud-run-service.yaml
spec:
  template:
    spec:
      containers:
        - name: devops-rag
          startupProbe:
            httpGet:
              path: /healthz?warmup=true
              port: 8080
            initialDelaySeconds: 2
            periodSeconds: 5
            failureThreshold: 10

The /healthz?warmup=trueendpoint triggers initialization. By the time Cloud Run marks the container as healthy, it’s fully warm.

Phase 4: Cost Management at Scale (Ongoing)

At 10K users, cost management isn’t an optimization — it’s a survival skill.

The Cost Pyramid

                    ┌────────┐
                    │ Opus   │  $15/M tokens
                    │ 4%     │  Complex reasoning, code gen
                   ┌┴────────┴┐
                   │ Sonnet    │  $3/M tokens
                   │ 18%      │  Summarization, analysis
                  ┌┴──────────┴┐
                  │ Haiku       │  $0.25/M tokens
                  │ 31%        │  Simple generation, chat
                 ┌┴────────────┴┐
                 │ Local (Qwen)  │  $0/M tokens
                 │ 47%          │  Classification, extraction
                └┴──────────────┴┘
                
Daily cost at 10K users: ~$165
Without tiered routing: ~$890
Savings: 81%

The cost pyramid evolved as we scaled. At alpha (5 users), 62% went to local models. At 10K, only 47% goes local — because the user mix shifted. Alpha users were internal developers making classification-heavy requests. Production users ask more complex questions that require Sonnet or Opus.

This is the insight most scaling guides miss: your cost distribution shifts with your user profile. The optimization that works at 100 users may not work at 10,000.

Per-Tenant Cost Attribution

At scale, “our daily cost is $165” is useless information. You need to know who is driving cost and why.

// Every LLM request carries tenant context
type CostAttribution struct {
    TenantID      string
    ServiceName   string
    OperationType string  // "classification", "generation", "analysis"
    ModelUsed     string
    InputTokens   int
    OutputTokens  int
    CostUSD       float64
    CachedHit     bool
}

// Emitted as a Datadog metric on every request
func (ca *CostAttribution) EmitMetrics() {
    tags := []string{
        fmt.Sprintf("tenant:%s", ca.TenantID),
        fmt.Sprintf("service:%s", ca.ServiceName),
        fmt.Sprintf("operation:%s", ca.OperationType),
        fmt.Sprintf("model:%s", ca.ModelUsed),
        fmt.Sprintf("cached:%v", ca.CachedHit),
    }
    statsd.Distribution("marga.request.cost", ca.CostUSD, tags, 1.0)
    statsd.Count("marga.request.tokens.input", int64(ca.InputTokens), tags, 1.0)
    statsd.Count("marga.request.tokens.output", int64(ca.OutputTokens), tags, 1.0)
}

This lets us build Datadog dashboards that answer: Which tenant consumes the most Opus tokens? Which operation type has the worst cost/value ratio? What’s our blended cost per request by service?

The $165/Day Budget at 10K Users

Category	Daily Cost	% of Total	Notes
LLM inference (Opus)	$42	25%	Complex reasoning only
LLM inference (Sonnet)	$38	23%	Analysis, summarization
LLM inference (Haiku)	$18	11%	Simple generation
LLM inference (Local)	$0	0%	47% of requests
Embedding API	$12	7%	After batching optimization
Cloud Run compute	$28	17%	3 services, auto-scaling
Datadog observability	$15	9%	APM, logs, metrics
Vector store (Pinecone)	$12	7%	Serverless tier
Total	$165	100%	$0.0165/user/day

$0.0165 per user per day. At that unit economics, 10K users costs less than most SaaS companies spend on Slack.

Phase 5: Monitoring That Scales With You

At 5 users, you can read every log line. At 10K, you need monitoring that surfaces problems before users notice them.

The Three Dashboards That Matter

We run three Datadog dashboards, each targeting a different audience:

Dashboard 1: Business Health(checked daily) — Active users, revenue per request, cost per request by tier, error rate by service, user satisfaction proxy (retry rate — high retries = frustrated users).

Dashboard 2: Service Health(checked continuously by alerts) — p50/p95/p99 latency per service, admission controller rejection rate, cache hit rate, provider health and failover events, container scaling events.

Dashboard 3: Cost Intelligence(checked weekly) — Cost by tenant, service, model, operation. Token waste analysis. Cache miss patterns. Embedding batch efficiency.

SLO-Based Alerting

Early on, we set threshold alerts: “Alert if p95 > 500ms.” This generated noise. A momentary spike at 3 AM when one user sent 50 rapid requests isn’t an incident.

We switched to SLO-based alerting: “Alert if we burn more than 20% of our monthly error budget in one hour.”

99.5% of requests must complete under our latency budget over a 30-day window. That’s 2,160 “free” failures per month at our volume. A 3 AM spike uses a few of those — no alert. A sustained degradation during peak hours burns through the budget fast — alert.

Predictive Metrics

The most valuable metrics aren’t the ones that tell you something broke. They’re the ones that tell you something will break.

// Predictive metrics we track
func (s *Server) emitPredictiveMetrics() {
    // 1. Queue depth trend (growing queue = future timeouts)
    statsd.Gauge("marga.admission.queue_depth", 
        float64(s.admission.QueueDepth()), nil, 1.0)
    
    // 2. Cache eviction rate (high evictions = cache too small)
    statsd.Count("marga.cache.evictions", 
        s.cache.EvictionCount(), nil, 1.0)
    
    // 3. Provider headroom (how close to rate limit?)
    for _, p := range s.providers {
        headroom := p.RateLimit - p.CurrentRPM
        statsd.Gauge("marga.provider.headroom",
            float64(headroom),
            []string{fmt.Sprintf("provider:%s", p.Name)},
            1.0)
    }
    
    // 4. Goroutine count (leak detection)
    statsd.Gauge("marga.runtime.goroutines",
        float64(runtime.NumGoroutine()), nil, 1.0)
    
    // 5. Memory growth rate (OOM prediction)
    var m runtime.MemStats
    runtime.ReadMemStats(&m)
    statsd.Gauge("marga.runtime.heap_mb",
        float64(m.HeapAlloc) / 1024 / 1024, nil, 1.0)
}

The provider headroom metric is the most actionable. When headroom drops below 20%, we alert — not because anything is broken, but because we’re one traffic spike away from rate limiting.

User Growth Patterns: What the Data Shows

Users (concurrent)
10,000 ┤                                          ╭──────
 8,000 ┤                                    ╭─────╯
 6,000 ┤                              ╭─────╯
 4,000 ┤                        ╭─────╯
 2,000 ┤                  ╭─────╯
 1,000 ┤            ╭─────╯
   847 ┤      ╭──╳──╯  ← The wall (Week 4)
   500 ┤   ╭──╯
   100 ┤╭──╯
     5 ┤╯ ← Alpha
       └─┬────┬────┬────┬────┬────┬────┬────┬────┬
       W1   W2   W3   W4   W5   W6   W7   W8   W9

Usage Patterns We Didn’t Expect

1. Power law distribution.8% of users generate 62% of LLM calls. These are developers running automation — the exact users MĀRGA was built for. We optimized their paths first.

2. Time-of-day bimodality.Traffic peaks at 10 AM and 2 PM SGT (morning and post-lunch coding sessions). But our most expensive requests come at 11 PM — developers running batch jobs before going to sleep. We auto-downgrade batch requests to cheaper models during off-peak when latency tolerance is higher.

3. Retry storms after degradation. When we returned 503s during the wall incident, well-behaved clients retried with backoff. Poorly-behaved clients retried immediately, amplifying the load 3x. We added client identification to admission control — known retry-storm clients get longer Retry-After values.

// Adaptive retry-after based on client behavior
func (ac *AdmissionController) RetryAfter(clientID string) int {
    history := ac.clientHistory.Get(clientID)
    if history == nil {
        return 2  // Default: 2 seconds
    }
    
    recentRetries := history.RetriesInWindow(30 * time.Second)
    switch {
    case recentRetries > 10:
        return 30  // Aggressive retrier: back off hard
    case recentRetries > 5:
        return 10  
    default:
        return 2
    }
}

What Most People Miss

The scaling story everyone tells is about adding servers, optimizing queries, and tuning infrastructure. That’s the easy part. The hard part is this:

Scaling AI products is fundamentally a cost problem, not a compute problem.

Traditional web apps scale by adding compute. AI products scale by reducingcompute — through caching, tiering, batching, and routing. Every request you can serve from cache, route to a cheaper model, or batch with other requests is money saved and capacity freed.

Users	Cost/User/Day	Total Daily	Notes
5	$9.00	$45	Alpha, no optimization
100	$1.50	$150	Added MĀRGA tiering
847	$0.45	$380	Pre-optimization crisis
1,000	$0.095	$95	Post-optimization
5,000	$0.025	$125	Caching + batching
10,000	$0.0165	$165	Fully optimized

Cost per user dropped 545x from alpha to 10K. That’s not linear scaling — that’s the compounding effect of every optimization layer stacking.

Common Mistakes We Made (So You Don’t Have To)

1. Optimizing before measuring. We spent three days building a custom embedding cache before discovering that only 12% of embedding requests were duplicates. The cache saved almost nothing. Rule: Instrument for one week. Then optimize the biggest line item. Repeat.

2. Scaling horizontally when the problem was vertical. When MĀRGA hit the lock convoy, our first instinct was “add more instances.” We scaled to 8 Cloud Run instances. The lock contention got worse because more instances meant more concurrent requests hitting the same bottleneck. Rule: Profile before scaling. If the bottleneck is inside a single process, horizontal scaling makes it worse.

3. Using the same SLA for every request. We initially promised <200ms for all MĀRGA requests. That meant routing classification requests through the same priority queue as code generation. Rule: Classify requests into priority lanes.

4. Ignoring embedding costs. Embeddings look cheap ($0.02/M tokens). At 200 queries/minute × 500 tokens average, that’s 6M tokens/hour = $2.88/day. But we were also re-embedding 50K document chunks on every index update. Rule: Cache embeddings aggressively. Embed once, store forever, invalidate on content hash change.

5. Not testing the scale-down path. We tested scaling up extensively but never tested scaling down. When traffic dropped at midnight, Cloud Run evicted containers with warm caches. The next morning’s traffic spike hit cold containers. Rule: Configure max-scale-down-rate to limit how fast containers are evicted.

The Architecture at 10K Users

                            ┌──────────────────┐
                            │   Cloudflare CDN  │
                            │   (static + WAF)  │
                            └────────┬─────────┘
                                     │
                            ┌────────▼─────────┐
                            │  Cloud Run LB     │
                            │  (global)         │
                            └────────┬─────────┘
                                     │
                     ┌───────────────┼───────────────┐
                     │               │               │
              ┌──────▼──────┐ ┌──────▼──────┐ ┌──────▼──────┐
              │   MĀRGA     │ │ DevOps RAG  │ │   RAKṢĀ     │
              │  (1-12)     │ │   (0-8)     │ │   (0-6)     │
              │  min: 1     │ │   min: 0    │ │   min: 0    │
              └──────┬──────┘ └──────┬──────┘ └──────┬──────┘
                     │               │               │
    ┌────────────────┼───────────────┼───────────────┘
    │                │               │
    ▼                ▼               ▼
┌────────┐    ┌──────────┐    ┌──────────┐
│Semantic│    │ Pinecone │    │ Semgrep  │
│ Cache  │    │ + BM25   │    │ Rules    │
└────┬───┘    └──────────┘    └──────────┘
     │
     ▼
┌──────────────────────────────────┐
│  Provider Pool                    │
│  ┌────────┐ ┌────────┐ ┌──────┐ │
│  │OpenAI  │ │Anthropic│ │Local │ │
│  │(backup)│ │(primary)│ │(Qwen)│ │
│  └────────┘ └────────┘ └──────┘ │
└──────────────────────────────────┘
         │
         ▼
  ┌──────────────┐
  │  Datadog APM  │
  │  + Metrics    │
  │  + Logs       │
  └──────────────┘

Max instances at peak: 26 containers total. Cost at peak: $0.89/hour. Cost at idle (2 AM): $0.02/hour (just MĀRGA’s min-instance).

What’s Next: The Road to 100K

10K is a milestone, not a destination. Here’s what we’re building for the next order of magnitude:

Edge routing. Move MĀRGA’s classification layer to Cloudflare Workers. Routing decisions at the edge eliminate a network hop and reduce p50 to <5ms globally.
Speculative execution. For high-priority requests, send to two providers simultaneously and return the first response. Cost doubles but latency drops to the minimum of both.
Fine-tuned small models. Instead of routing classification to generic Qwen 4B, fine-tune a 1B model on our specific tasks. 4x faster inference at equivalent accuracy.
Regional data residency. Multi-region Pinecone + regional Cloud Run deployments for enterprise data requirements.

None of these are complex. They’re just the next increment on the same principle: reduce the cost and latency of every request, at every layer, continuously.

The Punchline

Scaling AI products to 10,000 users cost us $165/day, zero hires, and about six weeks of focused engineering. The system that handles 10K today could handle 50K with the same architecture — we’d just need more Cloud Run instances and a larger cache.

The counterintuitive insight: scaling AI is cheaperper user than scaling traditional software, because every optimization compounds. Caching eliminates requests. Batching amortizes costs. Tiering routes work to the cheapest capable model. At each scale milestone, you discover new optimization opportunities that weren’t visible at the previous scale.

The wall at 847 users felt like a crisis. In hindsight, it was the best thing that happened to us. It forced us to build the performance infrastructure that made 10K possible — and that will make 100K routine.

Build for the crisis you haven’t hit yet. It’s coming.

Gaurav Sharma is the founder of Avyay (अव्यय). He builds distributed AI systems on consumer hardware and ships code at 3 AM — not because he’s awake, but because the build engine doesn’t sleep. Follow the scaling journey at avyay.ai/blog.