06 VĀNI · MĀRGA

Intelligent LLM Routing. 80% Less Cost.

मार्गा“The Path” — Sanskrit

A small, fast classifier in front of your LLM stack. Every request hits the advisor first — trivial tasks go to tiny models, complex reasoning goes to frontier models. Same quality, fraction of the cost.

Request Alpha Access API Documentation →

MĀRGA intelligent LLM routing visualization — requests flowing through the advisor to optimal model tiers

The Advisor Pattern

How MĀRGA Routes Requests

Like a hospital triage nurse — you don't send every patient to the head surgeon. MĀRGA classifies in under 10ms, then routes to the optimal model.

MĀRGA routing architecture — advisor classifies requests across 3 axes then routes to the optimal model tier

Request Flow
Incoming Request → MĀRGA Advisor (8ms) → Routing Table → Optimal Model
Tier 0: Trivial
→ Local 4B model
~500ms · $0.00
Tier 1: Simple
→ 7B–24B model
~2s · $0.001
Tier 2: Moderate
→ 70B or mid-tier API
~5s · $0.01
Tier 3: Complex
→ Frontier model
~15s · $0.05

⚡

8ms Classification

The advisor runs on a tiny 4B model. Classification takes 5–8ms on modest hardware — imperceptible latency for massive cost savings.

🎯

3-Axis Routing

Every request is classified on task type, complexity, and latency requirements. Three dimensions are enough — more adds noise.

💰

60–80% Cost Reduction

Most agent workloads don't need frontier models. MĀRGA routes 60–80% of requests to cheaper tiers without sacrificing output quality.

🔄

Self-Improving

Routing decisions feed back into the advisor. Over time, MĀRGA learns your workload patterns and gets more accurate at classification.

🏗️

Model Agnostic

Works with any LLM provider — OpenAI, Anthropic, Google, local Ollama models. Swap models without changing application code.

📊

Distributed Caching

Redis-powered distributed cache layer with intelligent prompt deduplication. Sub-millisecond cache lookups, cross-node consistency, and automatic TTL management — at 1M+ requests/day, the difference between 20% and 60% cache hits is the difference between profit and loss.

8ms

Classification Latency

80%

Cost Reduction

64K

Requests / Second

99.8%

Distributed Cache Hit Rate

Code Changes Required

Performance Validated

From $12K/mo to $2.4K/mo — Without Changing a Line of Code

A real-world stress test on commodity hardware demonstrates that MĀRGA handles production workloads with sub-millisecond overhead and extreme throughput.

The Problem

Every request → GPT-4

An AI-powered DevOps platform was sending all requests — from simple “what's the status?” queries to complex incident analysis — to a single frontier model. At 50K requests/day, costs were $12,000/month and climbing 40% month-over-month.

The Solution

MĀRGA as a drop-in proxy

Changed one line: base_url = "marga.avyay.ai/v1". The advisor classified each request in 8ms, routing 72% to cheaper tiers. Zero application code changes. Zero quality degradation on the tasks that matter.

The Result

80% cost reduction

Monthly LLM costs dropped from $12K → $2.4K. P95 response latency improved by 35% (cheaper models respond faster). The self-improving advisor now routes with 94% accuracy after 30 days of learning.

Stress Test Results — Apple M2 Max, 32GB RAM

Concurrency	Throughput	p50	p95	p99
100	64,532 req/s	1.3ms	3.2ms	4.2ms
1,000	59,345 req/s	15.2ms	22.0ms	26.1ms
5,000	47,329 req/s	87.2ms	98.2ms	102.9ms
10,000	49,419 req/s	66.1ms	—	125.0ms

Zero errors at 10,000 concurrent connections. Rate limiter accuracy: 100% (120/120 exact). Cache hit rate: 99.8%.

MĀRGA cost optimization — before and after comparison showing 80% reduction

One Line to Integrate

OpenAI-Compatible API. Change the URL, Keep Your Code.

MĀRGA speaks the OpenAI Chat Completions protocol. Any SDK or tool that works with OpenAI works with MĀRGA — just swap the base URL.

from openai import OpenAI

# Drop-in replacement — just change the base URL
client = OpenAI(
    base_url="https://marga.avyay.ai/v1",
    api_key="your-marga-key"
)

response = client.chat.completions.create(
    model="auto",   # MĀRGA picks the optimal model
    messages=[
        {"role": "user", "content": "Summarize this incident report..."}
    ]
)
print(response.choices[0].message.content)

Get API Key

Change Base URL

Point your OpenAI SDK to marga.avyay.ai/v1 — one line change.

Set model: "auto"

MĀRGA picks the best model. Or pin a tier with model: "tier-1".

Simple Pricing