How We Cut LLM Costs 73% While Improving Reliability

73%

Cost Reduction

99.97%

Uptime

40%

Latency Improvement

Providers

The Challenge: Direct API Calls and $2,400/month

Six months ago, our AI infrastructure looked like every other startup’s: direct OpenAI API calls scattered across 15 different files, a single provider with zero failover, and a monthly bill that kept climbing.

The symptoms were predictable:

Cost explosion. Every request — from simple text classification to complex multi-step reasoning — went to GPT-4. Simple queries that a $0 local model could handle were burning $15/million tokens.
2 AM 500 errors. OpenAI goes down roughly once every two weeks. With a single provider, every outage meant complete service failure. No fallback. No recovery. Just 500s until the provider came back.
Zero visibility. We had no idea which endpoints were expensive, which queries were simple enough for cheaper models, or what our actual cost-per-request looked like. The monthly invoice was a black box.
Vendor lock-in. Switching from OpenAI to Anthropic meant touching every file that made an API call. Model evaluation was impossible without weeks of refactoring.

“We were paying premium prices for every request, getting zero redundancy, and had no way to even measure the problem. The classic trifecta of engineering debt.”

The Solution: MĀRGA Intelligent LLM Router

MĀRGA (मार्ग — Sanskrit for “path”) sits between your application and every AI provider. Instead of calling OpenAI directly, you call MĀRGA — and it makes intelligent routing decisions based on cost, latency, capability, and provider health.

Architecture: One Endpoint, Four Providers

Application Request
       │
       ▼
┌──────────────────────────────────────────────┐
│              MĀRGA Router                     │
│                                               │
│  ┌─────────────┐   ┌──────────────────────┐  │
│  │   Classify   │──▶│  Provider Selection  │  │
│  │   Request    │   │                      │  │
│  │   Complexity │   │  Cost × Latency ×    │  │
│  └─────────────┘   │  Capability × Health  │  │
│                     └──────────┬───────────┘  │
│                                │              │
│  ┌─────────────────────────────┘              │
│  │                                            │
│  ├── Tier 1 (Simple) ──▶ Ollama Local ($0)   │
│  ├── Tier 2 (Medium) ──▶ Claude Sonnet ($3)  │
│  ├── Tier 3 (Complex) ──▶ Claude Opus ($15)  │
│  └── Failover ──────────▶ Next healthy provider │
│                                               │
│  Circuit Breaker: 5 failures → OPEN → 30s    │
│  Health Check: every 30 seconds               │
└──────────────────────────────────────────────┘

The key insight: 80% of LLM calls in a production application are simple enough for a $0 local model. Text classification, entity extraction, simple summarization, format conversion — none of these need GPT-4 or Claude Opus.

Drop-In Replacement: Zero Code Changes

MĀRGA implements the OpenAI-compatible API. Migrating from direct API calls takes one line:

from openai import OpenAI

# Before: Direct OpenAI call
# client = OpenAI(api_key="sk-...")

# After: MĀRGA-routed (one line change)
client = OpenAI(
    base_url="https://marga-cn4wkmlbva-as.a.run.app/v1",
    api_key="your-marga-key"
)

# Let MĀRGA choose the optimal model
response = client.chat.completions.create(
    model="auto",  # Intelligent routing
    messages=[{"role": "user", "content": "Classify this support ticket..."}]
)

# Or pin a specific model with automatic failover
response = client.chat.completions.create(
    model="claude-sonnet-4-20250514",  # Failover within model class
    messages=[{"role": "user", "content": "Analyze this contract..."}]
)

The model="auto"parameter activates MĀRGA’s routing engine. It estimates query complexity from input length, system prompt presence, and request metadata — then routes to the cheapest model capable of handling it.

Four Providers, One Routing Table

We configured MĀRGA with four providers spanning cloud and local inference:

Provider	Models	Cost / 1M Tokens	Use Case
Ollama (Mac M1)	Qwen3 27B, Devstral 24B	$0	Simple classification, extraction
Ollama (Linux)	DeepSeek R1 7B	$0	Reasoning tasks, code review
Anthropic	Claude Sonnet, Haiku	$3 – $8	Medium complexity, content gen
OpenAI	GPT-4o, GPT-4o-mini	$2.50 – $10	Fallback, specialized tasks

Circuit Breakers: How We Achieved 99.97% Uptime

MĀRGA implements a three-state circuit breaker per provider. This is the mechanism that turned our single-point-of-failure architecture into one with near-perfect uptime:

CLOSED (normal)
    │
    │ 5 consecutive failures OR
    │ >50% error rate in 60s window
    │
    ▼
OPEN (provider marked unhealthy)
    │
    │ 30-second cooldown
    │
    ▼
HALF-OPEN (testing recovery)
    │
    ├── Probe succeeds → CLOSED (resume traffic)
    └── Probe fails → OPEN (restart cooldown)

When OpenAI returns 5 consecutive errors, MĀRGA stops sending traffic within milliseconds — not after a timeout. Requests automatically route to Anthropic or local Ollama instances. When OpenAI recovers, MĀRGA tests with a single probe request before resuming full traffic.

The configuration that makes this work:

providers:
  - name: openai
    endpoint: https://api.openai.com/v1
    circuit_breaker:
      failure_threshold: 5       # Trip after 5 failures
      recovery_timeout: 30s      # Test recovery after 30s
      error_rate_threshold: 0.5  # Or 50% errors in window
      window_size: 60s           # Error rate window

  - name: anthropic
    endpoint: https://api.anthropic.com/v1
    priority: 2                  # Primary failover target
    circuit_breaker:
      failure_threshold: 5
      recovery_timeout: 30s

  - name: ollama-mac
    endpoint: http://100.x.x.x:11434
    priority: 3                  # Local fallback
    circuit_breaker:
      failure_threshold: 3       # Lower threshold for local
      recovery_timeout: 15s

health_check:
  interval: 30s                  # Probe all providers every 30s
  timeout: 5s

In practice, this means: when OpenAI has an outage at 2 AM, our users don’t notice. Requests route to Anthropic in under 100ms. The circuit breaker trips, the health check monitors recovery, and traffic resumes when the provider is healthy — all without human intervention.

The Results: Real Numbers

Cost Reduction: $2,400 → $648/month

The 73% cost reduction came from one routing rule: stop sending simple queries to expensive models.

Metric	Before MĀRGA	After MĀRGA	Change
Monthly LLM spend	$2,400	$648	-73%
Avg cost per request	$0.024	$0.0065	-73%
Requests to local models	0%	62%	—
Uptime	~98.5%	99.97%	+1.47%
P50 latency	1,200ms	720ms	-40%
Provider outage incidents	3/month (visible)	0 (auto-routed)	-100%

Latency: Faster Because Local Models Are Fast

The 40% latency improvement was a surprise. We expected cost savings but didn’t expect routing to make things faster. The reason: local Ollama models on the Mac M1 (64GB RAM) respond in 200-400ms for simple queries, compared to 800-1200ms for cloud API round trips.

When 62% of requests route locally, the average latency drops dramatically — even though complex queries still go to cloud providers at their normal speed.

Full Observability with Datadog APM

MĀRGA ships native Datadog APM integration. Every LLM request produces a trace with:

Provider selected and why (cost tier, health status, routing rule)
Token counts — input, output, and total per request
Cost attribution — actual dollar cost per request, per endpoint, per service
Latency breakdown — routing decision time vs provider response time
Circuit breaker events — every trip, recovery, and failover logged

// Datadog trace example
{
  "service": "marga-router",
  "operation": "llm.completion",
  "duration_ms": 340,
  "meta": {
    "marga.provider": "ollama-mac",
    "marga.model": "qwen3:27b",
    "marga.routing_tier": "simple",
    "marga.cost_usd": "0.000",
    "marga.tokens_in": 156,
    "marga.tokens_out": 89,
    "marga.routing_time_ms": 2,
    "marga.circuit_state": "closed"
  }
}

This visibility is what makes the cost optimization sustainable. Without it, you’re guessing which queries are “simple enough” for cheap models. With Datadog traces, you can see exactly which routing decisions are being made and adjust thresholds with data.

Get Started with MĀRGA

MĀRGA is available as a Docker image and a managed service on Google Cloud Run:

# Pull and run locally
docker pull ghcr.io/gaurav21/marga:latest
docker run -p 8080:8080 \
  -e OPENAI_API_KEY=sk-... \
  -e ANTHROPIC_API_KEY=sk-ant-... \
  -v ./config.yaml:/app/config.yaml \
  ghcr.io/gaurav21/marga:latest

# Or use the managed endpoint
curl https://marga-cn4wkmlbva-as.a.run.app/v1/chat/completions \
  -H "Authorization: Bearer your-key" \
  -d '{"model":"auto","messages":[{"role":"user","content":"Hello"}]}'

Live endpoint: marga-cn4wkmlbva-as.a.run.app
Documentation: docs.avyay.ai/marga
GitHub: github.com/gaurav21/avyay-marga

Gaurav Sharma is the founder of Avyay (अव्यय). MĀRGA is one of five microservices in the Avyay platform. Read the full architecture deep-dive at avyay.ai/blog/avyay-architecture.