The Challenge: Direct API Calls and $2,400/month
Six months ago, our AI infrastructure looked like every other startup’s: direct OpenAI API calls scattered across 15 different files, a single provider with zero failover, and a monthly bill that kept climbing.
The symptoms were predictable:
- Cost explosion. Every request — from simple text classification to complex multi-step reasoning — went to GPT-4. Simple queries that a $0 local model could handle were burning $15/million tokens.
- 2 AM 500 errors. OpenAI goes down roughly once every two weeks. With a single provider, every outage meant complete service failure. No fallback. No recovery. Just 500s until the provider came back.
- Zero visibility. We had no idea which endpoints were expensive, which queries were simple enough for cheaper models, or what our actual cost-per-request looked like. The monthly invoice was a black box.
- Vendor lock-in. Switching from OpenAI to Anthropic meant touching every file that made an API call. Model evaluation was impossible without weeks of refactoring.
“We were paying premium prices for every request, getting zero redundancy, and had no way to even measure the problem. The classic trifecta of engineering debt.”
The Solution: MĀRGA Intelligent LLM Router
MĀRGA (मार्ग — Sanskrit for “path”) sits between your application and every AI provider. Instead of calling OpenAI directly, you call MĀRGA — and it makes intelligent routing decisions based on cost, latency, capability, and provider health.
Architecture: One Endpoint, Four Providers
Application Request
│
▼
┌──────────────────────────────────────────────┐
│ MĀRGA Router │
│ │
│ ┌─────────────┐ ┌──────────────────────┐ │
│ │ Classify │──▶│ Provider Selection │ │
│ │ Request │ │ │ │
│ │ Complexity │ │ Cost × Latency × │ │
│ └─────────────┘ │ Capability × Health │ │
│ └──────────┬───────────┘ │
│ │ │
│ ┌─────────────────────────────┘ │
│ │ │
│ ├── Tier 1 (Simple) ──▶ Ollama Local ($0) │
│ ├── Tier 2 (Medium) ──▶ Claude Sonnet ($3) │
│ ├── Tier 3 (Complex) ──▶ Claude Opus ($15) │
│ └── Failover ──────────▶ Next healthy provider │
│ │
│ Circuit Breaker: 5 failures → OPEN → 30s │
│ Health Check: every 30 seconds │
└──────────────────────────────────────────────┘The key insight: 80% of LLM calls in a production application are simple enough for a $0 local model. Text classification, entity extraction, simple summarization, format conversion — none of these need GPT-4 or Claude Opus.
Drop-In Replacement: Zero Code Changes
MĀRGA implements the OpenAI-compatible API. Migrating from direct API calls takes one line:
from openai import OpenAI
# Before: Direct OpenAI call
# client = OpenAI(api_key="sk-...")
# After: MĀRGA-routed (one line change)
client = OpenAI(
base_url="https://marga-cn4wkmlbva-as.a.run.app/v1",
api_key="your-marga-key"
)
# Let MĀRGA choose the optimal model
response = client.chat.completions.create(
model="auto", # Intelligent routing
messages=[{"role": "user", "content": "Classify this support ticket..."}]
)
# Or pin a specific model with automatic failover
response = client.chat.completions.create(
model="claude-sonnet-4-20250514", # Failover within model class
messages=[{"role": "user", "content": "Analyze this contract..."}]
)The model="auto"parameter activates MĀRGA’s routing engine. It estimates query complexity from input length, system prompt presence, and request metadata — then routes to the cheapest model capable of handling it.
Four Providers, One Routing Table
We configured MĀRGA with four providers spanning cloud and local inference:
| Provider | Models | Cost / 1M Tokens | Use Case |
|---|---|---|---|
| Ollama (Mac M1) | Qwen3 27B, Devstral 24B | $0 | Simple classification, extraction |
| Ollama (Linux) | DeepSeek R1 7B | $0 | Reasoning tasks, code review |
| Anthropic | Claude Sonnet, Haiku | $3 – $8 | Medium complexity, content gen |
| OpenAI | GPT-4o, GPT-4o-mini | $2.50 – $10 | Fallback, specialized tasks |
Circuit Breakers: How We Achieved 99.97% Uptime
MĀRGA implements a three-state circuit breaker per provider. This is the mechanism that turned our single-point-of-failure architecture into one with near-perfect uptime:
CLOSED (normal)
│
│ 5 consecutive failures OR
│ >50% error rate in 60s window
│
▼
OPEN (provider marked unhealthy)
│
│ 30-second cooldown
│
▼
HALF-OPEN (testing recovery)
│
├── Probe succeeds → CLOSED (resume traffic)
└── Probe fails → OPEN (restart cooldown)When OpenAI returns 5 consecutive errors, MĀRGA stops sending traffic within milliseconds — not after a timeout. Requests automatically route to Anthropic or local Ollama instances. When OpenAI recovers, MĀRGA tests with a single probe request before resuming full traffic.
The configuration that makes this work:
providers:
- name: openai
endpoint: https://api.openai.com/v1
circuit_breaker:
failure_threshold: 5 # Trip after 5 failures
recovery_timeout: 30s # Test recovery after 30s
error_rate_threshold: 0.5 # Or 50% errors in window
window_size: 60s # Error rate window
- name: anthropic
endpoint: https://api.anthropic.com/v1
priority: 2 # Primary failover target
circuit_breaker:
failure_threshold: 5
recovery_timeout: 30s
- name: ollama-mac
endpoint: http://100.x.x.x:11434
priority: 3 # Local fallback
circuit_breaker:
failure_threshold: 3 # Lower threshold for local
recovery_timeout: 15s
health_check:
interval: 30s # Probe all providers every 30s
timeout: 5sIn practice, this means: when OpenAI has an outage at 2 AM, our users don’t notice. Requests route to Anthropic in under 100ms. The circuit breaker trips, the health check monitors recovery, and traffic resumes when the provider is healthy — all without human intervention.
The Results: Real Numbers
Cost Reduction: $2,400 → $648/month
The 73% cost reduction came from one routing rule: stop sending simple queries to expensive models.
| Metric | Before MĀRGA | After MĀRGA | Change |
|---|---|---|---|
| Monthly LLM spend | $2,400 | $648 | -73% |
| Avg cost per request | $0.024 | $0.0065 | -73% |
| Requests to local models | 0% | 62% | — |
| Uptime | ~98.5% | 99.97% | +1.47% |
| P50 latency | 1,200ms | 720ms | -40% |
| Provider outage incidents | 3/month (visible) | 0 (auto-routed) | -100% |
Latency: Faster Because Local Models Are Fast
The 40% latency improvement was a surprise. We expected cost savings but didn’t expect routing to make things faster. The reason: local Ollama models on the Mac M1 (64GB RAM) respond in 200-400ms for simple queries, compared to 800-1200ms for cloud API round trips.
When 62% of requests route locally, the average latency drops dramatically — even though complex queries still go to cloud providers at their normal speed.
Full Observability with Datadog APM
MĀRGA ships native Datadog APM integration. Every LLM request produces a trace with:
- Provider selected and why (cost tier, health status, routing rule)
- Token counts — input, output, and total per request
- Cost attribution — actual dollar cost per request, per endpoint, per service
- Latency breakdown — routing decision time vs provider response time
- Circuit breaker events — every trip, recovery, and failover logged
// Datadog trace example
{
"service": "marga-router",
"operation": "llm.completion",
"duration_ms": 340,
"meta": {
"marga.provider": "ollama-mac",
"marga.model": "qwen3:27b",
"marga.routing_tier": "simple",
"marga.cost_usd": "0.000",
"marga.tokens_in": 156,
"marga.tokens_out": 89,
"marga.routing_time_ms": 2,
"marga.circuit_state": "closed"
}
}This visibility is what makes the cost optimization sustainable. Without it, you’re guessing which queries are “simple enough” for cheap models. With Datadog traces, you can see exactly which routing decisions are being made and adjust thresholds with data.
Get Started with MĀRGA
MĀRGA is available as a Docker image and a managed service on Google Cloud Run:
# Pull and run locally
docker pull ghcr.io/gaurav21/marga:latest
docker run -p 8080:8080 \
-e OPENAI_API_KEY=sk-... \
-e ANTHROPIC_API_KEY=sk-ant-... \
-v ./config.yaml:/app/config.yaml \
ghcr.io/gaurav21/marga:latest
# Or use the managed endpoint
curl https://marga-cn4wkmlbva-as.a.run.app/v1/chat/completions \
-H "Authorization: Bearer your-key" \
-d '{"model":"auto","messages":[{"role":"user","content":"Hello"}]}'- Live endpoint: marga-cn4wkmlbva-as.a.run.app
- Documentation: docs.avyay.ai/marga
- GitHub: github.com/gaurav21/avyay-marga
Gaurav Sharma is the founder of Avyay (अव्यय). MĀRGA is one of five microservices in the Avyay platform. Read the full architecture deep-dive at avyay.ai/blog/avyay-architecture.