← Back to Blog
Engineering · May 2026 · 14 min read

Building AI Products at Scale: 7 Lessons That Cost Us Months

We built three production AI products — an LLM router, a security scanner, and a RAG pipeline — across 5 microservices with two people. Here’s what we learned about architecture, cost, quality, and the tradeoffs nobody warns you about.

Every AI startup has the same origin story: a prototype that works in a notebook, a demo that impresses investors, and a production deployment that humbles everyone involved.

At Avyay, we’ve shipped three AI products — MĀRGA (an LLM router), RAKṢĀ (a code security scanner), and DevOps RAG (an operational knowledge pipeline). They run across five microservices, three cloud regions, and a Tailscale mesh connecting consumer hardware to Google Cloud Run. Two people operate everything.

This article isn’t about the architecture (we covered that here). It’s about the decisions that shaped the architecture — the wrong turns, the expensive experiments, the moments where conventional wisdom was dead wrong, and the handful of insights that saved us.

These are field notes, not theory. Every lesson cost us either time, money, or both.


Lesson 1: Your First Architecture Decision Is Your Provider Strategy — Not Your Framework

Most teams start an AI product by choosing a framework: LangChain, LlamaIndex, Haystack, CrewAI. Then they choose a provider: OpenAI, usually. Then they build the product around both assumptions.

This is backwards.

Your first architecture decision should be: how will I swap providers when my primary one fails, raises prices, or gets outperformed by a competitor?

We learned this at 2 AM when OpenAI returned 500 errors for 47 minutes during a customer demo cycle. Our application called api.openai.com directly. No failover. No fallback. Just a loading spinner and a very awkward email the next morning.

That incident birthed MĀRGA. The core insight: provider diversity isn’t a nice-to-have, it’s a survival requirement.

Here’s what a naive provider integration looks like versus a resilient one:

# ❌ What most teams build (the "it works on my laptop" version)
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": prompt}]
)

# ✅ What you need in production (provider-agnostic routing)
from openai import OpenAI
client = OpenAI(
    base_url="http://marga.internal:8080/v1",  # MĀRGA intercepts
    api_key="your-marga-key"
)
response = client.chat.completions.create(
    model="auto",  # Let the router decide
    messages=[{"role": "user", "content": prompt}],
    extra_headers={"X-Cost-Tier": "medium"}  # Hint, not mandate
)

The second version looks almost identical. That’s the point. Your application code shouldn’t know or care which provider handles the request. The routing layer handles failover, cost optimization, and compliance — all invisibly.

What most people miss:The provider landscape changes faster than your application code. In the past 12 months, we’ve seen GPT-4 pricing drop 80%, Claude Opus launch and immediately become our default for complex reasoning, and local models (Qwen, DeepSeek) reach production quality for classification and extraction tasks. If your architecture hard-codes a provider, you’re paying last year’s prices for last year’s quality — permanently.


Lesson 2: 80% of Your LLM Calls Don’t Need an Expensive Model

This is the single most impactful cost optimization we’ve found, and almost nobody implements it.

When we first built MĀRGA, we routed everything through GPT-4. Our daily API spend was $120-180. We were building three products, running an autonomous build engine that dispatches coding agents, and processing hundreds of LLM calls per day. At that rate, we’d burn through $4,000/month on inference alone — before a single customer signed up.

Then we profiled our actual LLM calls. Here’s the distribution we found:

┌────────────────────────────────────────────────────────┐
│            LLM Call Complexity Distribution              │
│                                                          │
│  Simple (classification, extraction, formatting):  62%   │
│  Medium (summarization, analysis, generation):     26%   │
│  Complex (multi-step reasoning, code generation):  12%   │
│                                                          │
│  ████████████████████████████████████████████▓▓▓▓▓▓▓▓░░  │
│  ◄─── Tier 1: $0 (local) ──►◄── T2 ──►◄ T3 ►           │
└────────────────────────────────────────────────────────┘

62% of our calls were simple enough for a 4B parameter local model running on a MacBook. Things like:

  • “Is this a security finding or a false positive?” (binary classification)
  • “Extract the service name and error code from this log line” (structured extraction)
  • “Format this JSON as a markdown table” (template transformation)

We were paying $15 per million tokens for Claude Opus to do work that a free local model handles with 95%+ accuracy. That’s like hiring a neurosurgeon to put on a Band-Aid.

MĀRGA’s tiered routing reduced our daily spend from ~$150 to ~$45 — a 70% reduction — with zero measurable quality degradation on the tasks that moved to cheaper models.

// MĀRGA's cost tier selection (simplified)
func (c *CostOptimizer) SelectTier(req *LLMRequest) Tier {
    complexity := c.estimateComplexity(req)
    
    switch {
    case complexity < 0.3:
        return TierLocal    // Qwen 4B, $0
    case complexity < 0.7:
        return TierMedium   // Sonnet, ~$3/M tokens
    default:
        return TierPremium  // Opus, ~$15/M tokens
    }
}

func (c *CostOptimizer) estimateComplexity(req *LLMRequest) float64 {
    score := 0.0
    
    // Long system prompts suggest complex tasks
    if len(req.SystemPrompt) > 2000 { score += 0.3 }
    
    // Multi-turn conversations need more reasoning
    score += float64(len(req.Messages)) * 0.05
    
    // High temperature suggests creative/open-ended tasks
    if req.Temperature > 0.7 { score += 0.2 }
    
    // Structured output (JSON mode) usually simpler
    if req.ResponseFormat == "json" { score -= 0.15 }
    
    return math.Min(score, 1.0)
}

The hard part isn’t the routing logic — it’s measuring quality across tiers.We run a shadow evaluation pipeline: 5% of Tier 1 requests also go to Tier 3, and we compare outputs. If a task category shows >5% quality degradation on the cheap tier, MĀRGA automatically promotes it to the next tier. This creates a self-tuning system that optimizes cost without human intervention.

Common mistake:Teams implement cost tiers manually — hardcoding “use GPT-4 for summarization, use Haiku for classification.” This breaks whenever model capabilities shift. Implement it as a scoring function with automated quality monitoring, not a lookup table.


Lesson 3: Security Scanning for AI-Generated Code Is a Different Problem

When AI agents write your code, your security model changes in ways that traditional SAST tools don’t anticipate.

We discovered this the hard way. Our autonomous build engine was shipping 8-12 tasks per day across five codebases. Code quality was high — the agents wrote clean, well-tested, well-documented code. We were feeling good.

Then we audited three weeks of agent-generated commits. What we found:

FindingCountTraditional SASTRAKṢĀ
Hardcoded API keys (hallucinated)7✅ Caught✅ Caught
SQL injection via string concat3✅ Caught✅ Caught
Fabricated email addresses in code12❌ Missed✅ Caught
Session tokens in URL parameters4❌ Missed✅ Caught
os.system() with unsanitized input2✅ Caught✅ Caught
GET requests for state mutations6❌ Missed✅ Caught
Overly permissive CORS (*)8⚠️ Warning only✅ Blocked

The key column is the third one. Traditional SAST tools caught the textbook vulnerabilities — the ones in every OWASP Top 10 training. But they missed the AI-specific failure modes:

Hallucinated credentials. AI agents fabricate realistic-looking API keys, email addresses, and URLs. They’ll write dev@avyay.ai (an address that doesn’t exist) in a notification handler, or sk-proj-abc123...as a “placeholder” key that looks real enough to pass a regex check but isn’t in any secrets manager.

Architectural anti-patterns. An agent will write code that passes all pattern-based checks but violates architectural principles. Storing session tokens in query parameters is technically “valid” code — no scanner flags ?token=xxxin a URL builder. But it’s a security vulnerability that exposes tokens in server logs, browser history, and referrer headers.

Overly permissive defaults. Agents default to the most permissive configuration because it makes tests pass. CORS: *, allow_origins: ["*"], chmod 777. Each individually trivial. Collectively, a disaster.

RAKṢĀ addresses this with three layers:

  1. Pattern scanning (Semgrep + Bandit) — catches the textbook stuff
  2. Hallucination detection — flags fabricated credentials, emails, URLs that don’t resolve
  3. Architectural rules — custom Semgrep rules for AI-specific anti-patterns
# RAKṢĀ custom rule: GET requests should not mutate state
rules:
  - id: ai-pattern-get-mutation
    patterns:
      - pattern: |
          @app.get(...)
          def $FUNC(...):
              ...
              $DB.update(...)
      - pattern: |
          @app.get(...)
          def $FUNC(...):
              ...
              $DB.delete(...)
    message: "GET endpoint performs state mutation. Use POST/PUT/DELETE."
    severity: HIGH
    metadata:
      category: ai-specific
      rationale: >
        AI agents frequently use GET for all endpoints because
        it's simpler. This violates HTTP semantics and creates
        CSRF vulnerabilities.

The lesson:If you’re using AI to write production code, your security scanning needs to evolve beyond pattern matching. You need to scan for the failure modes that AI introduces — hallucinated data, permissive defaults, and architectural violations that look “correct” at the function level but are broken at the system level.


Lesson 4: RAG Quality Is a Chunking Problem, Not a Model Problem

When enterprises complain that their RAG system gives wrong answers, they almost always blame the LLM. “We need a better model.” “Let’s try GPT-4 instead of Sonnet.” “Maybe fine-tuning will fix it.”

It won’t. The problem is upstream — in how you chunk, embed, and retrieve documents.

We built DevOps RAG as an operational knowledge system — it ingests runbooks, incident postmortems, and deployment guides, then serves them to coding agents via MCP. The first version used vanilla RAG: chunk at 512 tokens, embed with text-embedding-ada-002, store in Pinecone, retrieve top-5 by cosine similarity.

The results were mediocre. Agents would ask “how do we roll back a failed deployment?” and get chunks about deployment configuration, rollback procedures, and pricing — three unrelated sections from three different documents stitched together into a plausible-sounding but wrong answer.

The fix wasn’t a better model. It was better chunking.

Problem 1: Chunk boundaries sever context. A runbook defines terms in section 1 and uses them in section 5. Splitting at 512 tokens cuts the connection. Fix: hierarchical chunking — we chunk at section level (H2 headers), keep the parent context (H1 + document title) as metadata, and include a 50-token overlap between adjacent chunks.

Problem 2: Vector similarity ≠ relevance. “Rollback a deployment” is semantically similar to “Configure a deployment” because they share the word “deployment.” But they answer completely different questions. Fix: hybrid retrieval — vector similarity for recall, BM25 keyword matching for precision, then rerank with a cross-encoder.

# DevOps RAG hybrid retrieval (simplified)
class HybridRetriever:
    def retrieve(self, query: str, top_k: int = 5) -> List[Chunk]:
        # Stage 1: Cast a wide net
        vector_results = self.pinecone.query(
            vector=self.embed(query),
            top_k=top_k * 3  # Over-retrieve for reranking
        )
        bm25_results = self.bm25_index.search(query, top_k=top_k * 3)
        
        # Stage 2: Merge and deduplicate
        candidates = self.reciprocal_rank_fusion(
            vector_results, bm25_results
        )
        
        # Stage 3: Rerank with cross-encoder
        scored = self.cross_encoder.rerank(query, candidates)
        
        return scored[:top_k]

Problem 3: No freshness signal. When a runbook is updated (post-incident revision), the old chunks linger in the vector store alongside the new ones. An agent might retrieve outdated procedures. Fix: Git-native ingestion — we index from Git, track commit SHAs per chunk, and tombstone chunks from deleted or modified files. Every PR merge triggers a re-index.

After these three changes, DevOps RAG’s answer quality (measured by agent task completion rate when using RAG-provided context) went from 64% to 89%. The LLM didn’t change. The embedding model didn’t change. We just fixed the plumbing.

What most people miss:RAG quality degrades silently. Unlike a 500 error or a timeout, a wrong-but-confident answer doesn’t trigger an alert. You need to build evaluation into the pipeline: track answer quality with user feedback, automated checks, or shadow comparisons. If you don’t measure retrieval quality, you can’t improve it.


Lesson 5: The Quality vs. Speed Tradeoff Is a Dial, Not a Switch

“Should we prioritize quality or speed?” is the wrong question. The right question is: “For this specific request, at this cost point, what’s the optimal position on the quality-speed spectrum?”

Different requests have different tolerance profiles:

High Quality, Slow                                    Low Quality, Fast
◄─────────────────────────────────────────────────────────────────────►

Security scan of a PR         Classifying log severity     Auto-generating
before merge to main     →    during incident triage   →   task descriptions
                                                           for build queue

Accuracy: 99%+ required       Accuracy: 90% acceptable     Accuracy: 80% fine
Latency: 30s acceptable       Latency: <2s required         Latency: <500ms
Cost: $0.10/scan fine         Cost: <$0.001/call            Cost: $0/call
Model: Opus                   Model: Sonnet                 Model: Qwen 4B local

MĀRGA implements this as a request-level configuration, not a system-wide setting:

# Each service declares its quality requirements
# RAKṢĀ: High quality, latency tolerant
response = client.chat.completions.create(
    model="auto",
    messages=messages,
    extra_headers={
        "X-Cost-Tier": "premium",
        "X-Quality-Min": "0.95",
        "X-Latency-Budget": "30000"  # 30s
    }
)

# Build engine: Fast and cheap, quality flexible
response = client.chat.completions.create(
    model="auto", 
    messages=messages,
    extra_headers={
        "X-Cost-Tier": "local",
        "X-Latency-Budget": "500"  # 500ms
    }
)

The key insight: quality and speed requirements are properties of the request, not the system. A single product (RAKṢĀ) might need Opus-level quality for analyzing a critical finding and Qwen-level speed for triaging whether a file is even worth scanning.

When we first built MĀRGA, we had one quality setting per service. RAKṢĀ always used Opus. The build engine always used Sonnet. This was simple but wasteful — RAKṢĀ was spending $0.10 per call to classify files that a 4B model could handle, and the build engine was over-paying for simple task descriptions.

Switching to per-request quality budgets reduced our total inference cost by another 25% on top of the tiered routing savings.

Common mistake:Teams build A/B testing for model quality at the wrong level. They A/B test the system: “all requests go to GPT-4 this week, Sonnet next week.” The variance is too high to measure anything. A/B test at the request class level: “classification tasks go to Haiku vs. local model” — that’s a comparison you can actually measure.


Lesson 6: Distributed AI Systems Fail at the Boundaries, Not the Centers

Our services rarely fail because of a bug in the LLM call or the business logic. They fail at the boundaries — the seams between services, the edges where your system meets the outside world.

Here are the top 5 production incidents from our first 90 days, all boundary failures:

1. Tailscale node goes offline mid-task. A MacBook lid closed during a build task. The build engine kept polling the node, the task appeared “in progress” for 6 hours, and downstream tasks starved. Fix: Heartbeat monitoring with 60-second timeout. If a node misses 3 heartbeats, all its tasks get reassigned.

2. OpenAI rate limit cascading to Anthropic. When OpenAI rate-limited us, MĀRGA correctly failed over to Anthropic. But the sudden traffic spike hit Anthropic’s rate limit too, causing a cascade that left zero available providers for 4 minutes. Fix: Exponential backoff with jitter per provider. When failing over, MĀRGA now ramps traffic to the backup provider gradually (10% → 25% → 50% → 100% over 30 seconds) instead of sending 100% immediately.

3. Vector store index drift. A runbook was updated but the re-indexing webhook failed silently (Pinecone returned 200 but didn’t actually index the new chunks). DevOps RAG served stale content for 3 days. Fix: Post-index verification — after indexing, query for the new chunks and verify they exist. Plus a daily full reconciliation job that compares Git HEAD with the index manifest.

4. Circuit breaker stuck open. A transient network issue caused MĀRGA’s circuit breaker for OpenAI to trip. The half-open probe used a request that happened to trigger content moderation rejection (returned 400, not a real provider failure). The circuit interpreted this as “provider still failing” and stayed open for the full cooldown. Fix: Probes use a known-good minimal request ("Say hi") instead of replaying the request that triggered the trip.

5. Token count estimation mismatch. MĀRGA estimates token counts to select the right model tier. Our tokenizer used cl100k_base (OpenAI’s encoding). But when routing to Anthropic, the actual token count was different (Claude uses a different tokenizer), causing requests to exceed context limits. Fix:Per-provider tokenizer selection. When routing to Claude, use Claude’s tokenizer for the limit check.

Every one of these failures was at a boundary: service-to-service, service-to-provider, service-to-infrastructure. The business logic inside each service worked perfectly. The connections between them didn’t.

The lesson: Spend 70% of your reliability engineering effort on boundaries. Circuit breakers, retries, timeouts, heartbeats, reconciliation jobs. The center is boring and reliable. The edges are where production breaks.


Lesson 7: Observability Is the Product (Not a Feature)

When your platform is built by AI agents and run by AI agents, observability stops being a “DevOps concern” and becomes the core feedback loop that determines whether your system improves or stagnates.

We instrument everything through Datadog. Every LLM request through MĀRGA generates a trace with:

  • Provider, model, and routing decision
  • Input/output token counts and estimated cost
  • Latency breakdown (routing time, provider time, total time)
  • Quality tier and A/B test assignment
  • Circuit breaker state at request time

This data doesn’t just power dashboards. It powers the agents that improve the system.

┌───────────┐     ┌─────────┐     ┌──────────────┐
│ Agent     │────▶│ MĀRGA   │────▶│ Datadog APM  │
│ writes    │     │ routes  │     │ traces +     │
│ code      │     │ request │     │ metrics      │
└───────────┘     └─────────┘     └──────┬───────┘
                                         │
                                         ▼
                                  ┌──────────────┐
                                  │ Datadog MCP  │
                                  │ server       │
                                  └──────┬───────┘
                                         │
                                         ▼
                                  ┌──────────────┐
                                  │ Agent reads  │
                                  │ traces,      │
                                  │ identifies   │
                                  │ issues,      │
                                  │ writes fix   │
                                  └──────────────┘

The Datadog MCP integration is what closes the loop. When a coding agent is working on MĀRGA, it can query production metrics directly:

Agent: "What's the p95 latency for MĀRGA routing decisions
        over the last 24 hours?"
→ Datadog MCP → traces tool → aggregate by operation → return stats

Agent: "Which provider has the highest error rate this week?"
→ Datadog MCP → metrics tool → trace.marga.request.errors
  by provider → return ranking

The agent uses this production data to make better code decisions. If the p95 latency for the cost optimizer is 45ms (our budget is 20ms), the agent can identify the bottleneck, profile the code, and ship a fix — all without a human looking at a dashboard.

What most people miss: Most teams treat observability as a cost center — something you grudgingly pay for to avoid being blind during incidents. When AI agents consume your telemetry, observability becomes a revenue driver. Better telemetry → better agent decisions → better product → more value. It’s the difference between Datadog as a monitoring tool and Datadog as a development accelerator.

Our actual Datadog setup produces:

  • ~15,000 spans/day across all services
  • Cost attribution per model, per service, per customer
  • Automated anomaly detection on latency and error rates
  • Real-time routing decision auditing (why did MĀRGA pick this provider?)

Every dollar we spend on observability pays for itself in faster debugging, better routing decisions, and agents that can self-correct based on production data.


The Numbers: What Scale Actually Looks Like for a Two-Person Company

Let’s get concrete. Here are our actual production numbers after 90 days:

MetricValue
Daily LLM calls (all services)~2,400
Daily inference cost (after optimization)~$45
Daily inference cost (before MĀRGA)~$150
Cost reduction70%
Requests routed to free local models62%
MĀRGA p50 routing latency8ms
MĀRGA p95 routing latency23ms
Provider failovers per week3-5
Mean failover time1.2s
RAKṢĀ scans per day~45
Security findings caught pre-deploy12-15/week
DevOps RAG queries per day~80
RAG answer quality (task completion)89%
Build engine tasks shipped per day8-12
Services in production5
Cloud Run instances (peak)8
Cloud Run instances (idle)0 (scale to zero)
Monthly infrastructure cost~$180
Team size2

The most revealing number is the ratio: $45/day in inference, $6/day in infrastructure ($180/30), shipping 8-12 tasks/day across 5 codebases. That’s the leverage of AI-native architecture done right.

For comparison, a similar setup with a traditional team would require:

  • 3-5 engineers ($50K-80K/month loaded)
  • Static infrastructure always running (~$500-1000/month)
  • 2-week sprint cycles instead of continuous delivery

We’re not saying AI replaces engineers. We’re saying it changes the economics enough that two people can operate what used to require a team.


The Meta-Lesson: Build for the Architecture You’ll Need, Not the One You Have

Every lesson above has a common thread: the decisions that seem premature today become critical infrastructure tomorrow.

Building a routing layer when you only use one provider? Premature — until that provider has an outage. Adding per-request quality budgets when you have 100 requests/day? Over-engineering — until you have 2,400 requests/day and your bill triples. Implementing hybrid retrieval for a RAG system with 50 documents? Overkill — until you have 500 documents and accuracy drops to 64%.

The art is distinguishing “premature optimization” from “architectural foresight.” Our heuristic: if a decision will cost 10x more to retrofit than to build in, build it in. Provider abstraction, per-request quality routing, and hybrid retrieval all meet that bar. A custom Kubernetes operator does not.

Start with the simplest thing that works. But make the simple thing extensible in the directions you know you’ll need to go. MĀRGA started as a 200-line Go proxy that just round-robined between OpenAI and Anthropic. DevOps RAG started as a Python script that embedded files and called Pinecone. RAKṢĀ started as a Semgrep wrapper.

Each one grew into a production service because the initial design had the right extension points — even when those extension points weren’t used yet.

Build the foundation right. The building rises on its own.


Try It Yourself

Every product mentioned in this article is available:

  • MĀRGA — LLM routing with cost optimization, failover, and observability. GitHub | docker pull ghcr.io/avyay-ai/marga
  • RAKṢĀ — Security scanning for AI-generated code. pip install raksha-cli | GitHub Action
  • DevOps RAG — Operational knowledge via MCP. Works with Claude Code, Cursor, and OpenClaw.
  • Full Platformavyay.ai | Join the alpha

Gaurav Sharma is the founder of Avyay (अव्यय). He builds distributed AI systems on consumer hardware and ships code at 3 AM — not because he’s awake, but because the build engine doesn’t sleep. Follow the build at avyay.ai/blog.

Build With Us

Ready to Build Your Own AI Platform?

We help teams build autonomous AI infrastructure with intelligent routing, security scanning, and continuous delivery — on their own terms.

Get in Touch →