The Advisor Pattern: Why Your AI Agent Needs a Cheaper Brain

The average AI agent team spends 80% of their LLM budget on tasks that don't need frontier models.

Not 80% of their hardest tasks. 80% of all tasks. The “classify this as positive or negative” tasks. The “extract the email from this text” tasks. The “summarize this paragraph” tasks. These all get shipped to Claude Opus or GPT-4o at $15/million tokens because the routing logic is a single line: model = "claude-opus-4".

The fix isn't switching to a cheaper model. If you do that uniformly, your hard tasks collapse. The fix is knowing when each model should handle each request — and making that decision in under 10 milliseconds.

That's the advisor pattern.

What Is the Advisor Pattern?

The advisor pattern puts a small, fast, cheap model in front of your actual LLM stack. Every request hits the advisor first. The advisor classifies the request across a few dimensions, then routes it to the appropriate model tier.

The classification is crude by design. Three dimensions are enough:

Task type: Is this coding, reasoning, extraction, classification, conversation, or summarization?

Complexity: Is this trivial (a 4B model handles it), simple (7B), moderate (70B or a mid-tier API), or complex (frontier model required)?

Latency requirement: Does this need a response in 500ms (interactive), 10s (batch processing), or 30s+ (deep analysis)?

The advisor itself runs on a tiny model — something like Qwen 3.5 at 4 billion parameters. It processes the classification in 5-8ms on modest hardware. That's the overhead. For that 8ms, you get to route 60-80% of your requests away from your most expensive model.

This is not “just use a cheaper model.” That's a blunt instrument. The advisor pattern is a scalpel. It looks at each request individually and makes a routing decision based on what that specific request actually needs. A simple extraction task goes to a 4B local model. A complex architectural reasoning task goes to Claude Opus. Everything in between gets matched to the cheapest model that can handle it well.

Think of it like a hospital triage nurse. You don't send every patient to the head surgeon. The triage nurse spends 30 seconds assessing severity, then routes: minor injury → nurse practitioner, broken bone → ER doctor, cardiac event → surgeon. The nurse doesn't treat anyone. They just route — fast, cheap, and surprisingly accurate.

Why This Matters Now

Three things changed in the last 12 months that made the advisor pattern go from “nice optimization” to “table stakes”:

Small models got dramatically better. Qwen 3.5 4B outperforms GPT-3.5 on most benchmarks. Phi-3 Mini matches GPT-4-turbo on classification tasks. A year ago, routing to a small model meant accepting garbage output for anything beyond basic extraction. Today, small models handle 60-70% of typical agent workloads at near-frontier quality.

Open-source models are competitive for specific tasks.DeepSeek R1 at 7B handles reasoning chains that would have required a 70B model in 2024. Devstral 24B writes code that passes benchmarks Claude Haiku struggles with. The model landscape isn't a simple ladder anymore — it's a matrix of specializations.

Agent workloads are growing. A single user request to an AI agent might spawn 5-15 sub-tasks: classify intent, retrieve context, summarize documents, generate a response, check for safety, format output. Each sub-task has different requirements. Routing them all to the same model is like using a sledgehammer for every nail, screw, and thumbtack.

At Anthropic's “Code with Claude” event, Michael Shi made an observation that stuck: “The advisor pattern is underrated — by selectively routing to frontier models, teams can achieve near-SOTA behavior at a fraction of the cost.”

At scale — 1M+ requests per day — even a 5% cost reduction translates to thousands of dollars monthly. The advisor pattern doesn't save 5%. It saves 60-80%.

Architecture: How to Build It

Here's the routing architecture we're building at Avyay:

┌─────────────────────────────────────────────────────────────┐
│                    Incoming Request                          │
└──────────────────────┬──────────────────────────────────────┘
                       │
                       ▼
┌─────────────────────────────────────────────────────────────┐
│              Advisor / Classifier (qwen3.5:4b)              │
│                                                             │
│   Input:  "Debug this race condition in my distributed      │
│            system's message queue"                           │
│                                                             │
│   Output: { task: "coding", complexity: "complex",          │
│             latency: "relaxed" }                             │
│                                                             │
│   Time: ~8ms  |  Cost: ~$0.00001                            │
└──────────────────────┬──────────────────────────────────────┘
                       │
                       ▼
┌─────────────────────────────────────────────────────────────┐
│                    Routing Table                             │
│                                                             │
│   coding + complex + relaxed  →  anthropic/claude-opus-4    │
│   coding + simple  + fast     →  ollama/devstral:24b        │
│   extraction + trivial + fast →  ollama/qwen3.5:4b          │
│   reasoning + moderate + any  →  ollama/deepseek-r1:7b      │
│   summarization + simple      →  ollama/qwen3.5:4b          │
│   conversation + any          →  anthropic/claude-sonnet-4  │
└──────────────────────┬──────────────────────────────────────┘
                       │
                       ▼
┌─────────────────────────────────────────────────────────────┐
│                   Model Tiers                                │
│                                                             │
│   Tier 0 (Trivial)  │ qwen3.5:4b      │ Local  │ Free     │
│   Tier 1 (Simple)   │ deepseek-r1:7b  │ Local  │ Free     │
│   Tier 2 (Moderate) │ devstral:24b    │ Local  │ Free     │
│   Tier 3 (Complex)  │ Claude Sonnet   │ API    │ $3/Mt    │
│   Tier 4 (Frontier) │ Claude Opus     │ API    │ $15/Mt   │
└─────────────────────────────────────────────────────────────┘

The advisor sees every request. It doesn't answer anything. It classifies and routes. The actual work happens downstream.

Two concrete examples of how this plays out:

Example 1: Trivial Task

Request: “Classify this customer message as question, complaint, or feedback”
Advisor: { task: "classification", complexity: "trivial", latency: "fast" }
Route: ollama/qwen3.5:4b (local, free, 200ms response)
Frontier cost if misrouted: $0.003 per request × 10,000 daily = $30/day wasted

Example 2: Complex Task

Request: “Debug this race condition in my distributed system's message queue. Here's the Go code...”
Advisor: { task: "coding", complexity: "complex", latency: "relaxed" }
Route: anthropic/claude-opus-4 (API, $15/Mt, 15-30s response)
Cheap-model cost if misrouted: Hours of developer time debugging hallucinated fixes

The routing decision takes 8ms. The cost difference per request ranges from negligible to 1000x.

Working Example: A Real Routing Configuration

Here's the routing configuration format we use. It's YAML because your infrastructure team will thank you later.

# marga-routing.yaml — MĀRGA intelligent routing config

advisor:
  model: ollama/qwen3.5:4b
  timeout_ms: 50
  prompt_template: |
    Classify this request. Respond with JSON only.
    Categories — task: coding|reasoning|extraction|
                       classification|summarization|conversation
    Categories — complexity: trivial|simple|moderate|complex
    Categories — latency: fast|normal|relaxed
    
    Request: {input}

routing_rules:
  # Tier 0: Trivial — handle locally on smallest model
  - match:
      complexity: trivial
    model: ollama/qwen3.5:4b
    latency_target_ms: 500
    cost_per_1k_tokens: 0.0
    
  # Tier 1: Simple reasoning and summarization
  - match:
      task: [reasoning, summarization]
      complexity: simple
    model: ollama/deepseek-r1:7b
    latency_target_ms: 5000
    cost_per_1k_tokens: 0.0
    
  # Tier 1b: Simple coding tasks
  - match:
      task: coding
      complexity: simple
    model: ollama/devstral:24b
    latency_target_ms: 8000
    cost_per_1k_tokens: 0.0
    
  # Tier 2: Moderate complexity — local heavy model
  - match:
      task: [coding, reasoning]
      complexity: moderate
    model: ollama/qwen3:27b
    latency_target_ms: 15000
    cost_per_1k_tokens: 0.0
    
  # Tier 3: Complex non-coding — Sonnet
  - match:
      complexity: complex
      task: [reasoning, summarization, conversation]
    model: anthropic/claude-sonnet-4
    latency_target_ms: 20000
    cost_per_1k_tokens: 3.0
    
  # Tier 4: Complex coding and frontier reasoning
  - match:
      task: coding
      complexity: complex
    model: anthropic/claude-opus-4
    latency_target_ms: 30000
    cost_per_1k_tokens: 15.0

  # Default fallback
  - match: {}
    model: anthropic/claude-sonnet-4
    latency_target_ms: 15000
    cost_per_1k_tokens: 3.0

fallback_chain:
  - condition: model_unavailable
    strategy: next_tier_up
  - condition: timeout_exceeded
    strategy: next_tier_up
  - condition: all_local_down
    strategy: anthropic/claude-sonnet-4

semantic_cache:
  enabled: true
  similarity_threshold: 0.92
  ttl_seconds: 3600
  embedding_model: ollama/qwen3.5:4b

Now let's trace a real request through this system:

Request: "What's the capital of France?"

Step 1 — Advisor (qwen3.5:4b)
  Input tokens: 42
  Classification: { task: "extraction", complexity: "trivial",
                    latency: "fast" }
  Time: 6ms
  Cost: $0.000000

Step 2 — Routing
  Match: complexity=trivial → ollama/qwen3.5:4b
  Time: <1ms

Step 3 — Execution (qwen3.5:4b)  
  Input tokens: 42
  Output tokens: 8
  Response: "Paris"
  Time: 180ms
  Cost: $0.000000

Total: 187ms, $0.00
Frontier alternative: 2400ms, $0.0007

Request: "Refactor this 200-line Go function to use the 
         strategy pattern. Handle error propagation through
         the chain and add context cancellation support."

Step 1 — Advisor (qwen3.5:4b)
  Input tokens: 1,850 (includes the Go code)
  Classification: { task: "coding", complexity: "complex",
                    latency: "relaxed" }
  Time: 8ms
  Cost: $0.000000

Step 2 — Routing
  Match: task=coding, complexity=complex → anthropic/claude-opus-4
  Time: <1ms

Step 3 — Execution (claude-opus-4)
  Input tokens: 1,850
  Output tokens: 2,400
  Response: [refactored code with strategy pattern]
  Time: 18,200ms
  Cost: $0.064

Total: 18,209ms, $0.064
Cheap-model alternative: 3,200ms, $0.00 — but the output would be wrong

The advisor adds a rounding error of latency and zero meaningful cost. What it saves depends on your traffic distribution — but for most agent workloads, 65-75% of requests are Tier 0-1.

What Most People Miss

The advisor pattern has three second-order effects that matter more than the direct cost savings.

1. Routing to the Right Model Improves Quality, Not Just Cost

This is counterintuitive. You'd think sending a simple task to a frontier model would give you the best result. Often, it gives you a worse result.

Ask Claude Opus to classify a message as positive/negative, and you might get a 200-word analysis of sentiment nuances, hedged with caveats, when all you needed was “positive.” Ask a 4B model the same question and you get “positive” in 180ms. The small model's constraint — its inability to overthink — becomes an advantage.

Frontier models hallucinate complexity where there is none. They add unnecessary qualifications to simple answers. They produce verbose responses to trivial questions. For 60% of agent sub-tasks, the “dumber” model is the better model — not despite its limitations, but because of them.

2. Cache Hit Rates Compound with Intelligent Routing

When you route all requests to one model, your prompt cache is a jumbled mess of coding tasks, extractions, conversations, and reasoning chains. Cache hit rates hover around 15-20%.

When you route by task type, something interesting happens. Your coding model sees mostly coding requests. Its prompt cache fills with coding-related prefixes. Cache hit rate climbs to 40-50% because similar requests cluster together.

This is free money. Anthropic charges nothing for cached prompt tokens on cache hits. OpenAI charges 50% less. If your routing pushes cache hit rates from 20% to 50%, that's a 15-25% additional cost reduction on top of the routing savings.

3. The Classifier Itself Can Be Cached

Request patterns repeat. If you've classified “Summarize this meeting transcript” as { task: "summarization", complexity: "simple" } once, you can cache that classification and skip the advisor entirely for similar requests.

A simple embedding-based similarity check (threshold: 0.92) catches 30-40% of incoming requests as “we've classified something like this before.” That's 30-40% of requests that skip even the 8ms advisor overhead.

The three effects stack: routing saves 60-80% on model costs, cache clustering saves another 15-25%, and classifier caching saves the already-tiny advisor overhead. The compound effect is significantly larger than any single optimization.

Common Mistakes

We've seen (and made) every one of these. Save yourself the debugging.

1. Making the Classifier Too Smart

The advisor should be fast and crude, not accurate and slow. If your classifier takes 500ms, you've built a second inference step, not a router. The whole point is that classification is a much simpler task than execution.

A 4B model with a well-structured prompt classifies correctly ~90% of the time. That's good enough. The 10% it misclassifies mostly goes one tier off — a “simple” task routes to “moderate,” which still works, just costs a bit more. Catastrophic misrouting (complex task → trivial model) happens <1% of the time, and the fallback chain catches it.

Don't use a 70B model as your advisor. Don't fine-tune the classifier until you've proven the pattern works with a stock small model. Premature optimization of the optimizer is peak irony.

2. No Fallback Chain

Your local models will go down. The Mac running your 27B model will go to sleep. The GPU will run out of memory because someone left a Jupyter notebook open.

Without a fallback chain, a downed local model means failed requests. With one, it means slightly more expensive requests — which is always the right trade.

fallback_chain:
  - condition: model_unavailable
    strategy: next_tier_up
  - condition: timeout_exceeded  
    strategy: next_tier_up
  - condition: all_local_down
    strategy: anthropic/claude-sonnet-4

Always have a cloud API as your last-resort fallback. Local-only is a cost optimization, not a reliability strategy.

3. Routing by Model Name Instead of Capability

Don't build routing rules like if complex: use claude-opus-4. Build them like if complex coding: use best_coding_model. Then map best_coding_model to a specific model in a separate config.

Models change every few months. Claude Opus 4 is the best coding model today. In six months, it might be Gemini 2.5 Ultra or something that doesn't exist yet. If your routing logic references model names, every model upgrade requires changing routing rules. If it references capabilities, you update one mapping table.

# Good — routes by capability
capability_map:
  best_coding: anthropic/claude-opus-4
  best_reasoning: anthropic/claude-opus-4
  fast_extraction: ollama/qwen3.5:4b
  balanced: anthropic/claude-sonnet-4

# Bad — model names in routing rules
routing_rules:
  - if: coding AND complex
    model: anthropic/claude-opus-4  # hardcoded, fragile

4. Over-Optimizing for Cost at the Expense of Quality

It's tempting to push every slider toward “cheaper.” Resist this.

Set a quality floor: define the minimum acceptable quality for each task type, then optimize cost only above that floor. If your extraction tasks need 95% accuracy, and the 4B model delivers 93%, don't route extractions to it — even though it's free. Route to the 7B model that hits 96%, and accept the (still negligible) cost.

The advisor pattern should make your system cheaper andbetter. If you're sacrificing quality for cost, you've over-rotated.

Cost Analysis

Let's run real numbers. Assume a moderately active AI agent platform processing 1 million LLM requests per month. Average request: 500 input tokens, 300 output tokens.

Cost comparison across different model routing strategies

Scenario	Model Mix	Monthly Cost	Quality Score
Everything to Claude Opus	100% frontier	~$15,000	98%
Everything to GPT-4o-mini	100% cheap	~$600	72%
Advisor pattern (mixed routing)	65% local, 25% mid, 10% frontier	~$2,500	94%
Advisor + semantic caching	Same mix, 40% cached	~$1,500	94%

Let's break down the advisor pattern scenario:

Traffic distribution (1M requests):
  650,000 × Tier 0-1 (local, free)        = $0
  250,000 × Tier 2-3 (Sonnet, $3/Mt)      = $600
  100,000 × Tier 4 (Opus, $15/Mt)         = $1,900
  Advisor overhead (1M classifications)     = ~$0 (local model)
                                      Total: ~$2,500

With 40% semantic cache hit rate:
  400,000 cached responses (skip inference) = $0
  390,000 × Tier 0-1 (local, free)         = $0
  150,000 × Tier 2-3 (Sonnet)              = $360
  60,000 × Tier 4 (Opus)                   = $1,140
                                       Total: ~$1,500

The advisor pattern gets you 94% of frontier quality at 17% of the cost. Add semantic caching and it drops to 10%.

The quality score isn't uniform. For the 10% of requests that actually need a frontier model, you still get frontier quality. For the 65% that are trivial, you get equivalent or better quality (less over-thinking, faster responses). The 4% quality gap comes from the ~25% of “moderate” requests where a local model is good-enough but not quite as polished as Opus.

For most applications, that's the right trade.

The Decision Tree

For teams evaluating whether the advisor pattern is worth implementing:

Is the advisor pattern worth it for you?

├── Less than 10K requests/month?
│   └── Probably not. Just use Sonnet for everything.
│       The engineering cost exceeds the savings.
│
├── 10K-100K requests/month?
│   └── Maybe. If >50% of requests are simple 
│       (classification, extraction, formatting),
│       the pattern pays for itself in a month.
│
├── 100K-1M requests/month?
│   └── Yes. You're leaving $5,000-12,000/month 
│       on the table without intelligent routing.
│
└── 1M+ requests/month?
    └── This is mandatory. At this scale, the advisor
        pattern is the difference between a viable 
        business and a cloud bill that eats your margin.

What We're Building

Everything described in this article is the design philosophy behind MĀRGA (मार्ग) — the intelligent routing layer we're building at Avyay.

MĀRGA is Sanskrit for “path” — the idea that every request should find its optimal path through your model infrastructure. Not the most expensive path. Not the cheapest. The right one.

What MĀRGA does:

Drop-in OpenAI API replacement. Point your existing code at MĀRGA instead of api.openai.com. No SDK changes. No refactoring. It speaks the OpenAI API format and routes transparently.
Classifies every requestusing a local advisor model. Task type, complexity, latency requirements — all determined in <10ms.
Routes to the optimal model from your configured pool. Local Ollama models, cloud APIs, self-hosted endpoints — MĀRGA treats them all as interchangeable targets with different cost/quality/latency profiles.
Caches semantically similar requests so repeated patterns skip inference entirely. Cache hit rates of 40-60% are typical for production agent workloads.
Learns from feedback. When a routed response gets flagged as poor quality, MĀRGA adjusts the routing rules. Over time, the classifier gets better at routing edge cases — without manual tuning.
Falls back gracefully. If a local model is unavailable, requests automatically escalate to the next tier. No failed requests. No manual intervention.

We're running this on the same distributed infrastructure from our first article — three machines, multiple local models, cloud API fallbacks. The advisor pattern is what turns a collection of models into an intelligent system.

Start Here

If you're spending more than $500/month on LLM APIs and more than half your requests are simple tasks, the advisor pattern will pay for itself in the first week.

The implementation is straightforward:

Deploy a small local model (Qwen 3.5 4B runs on anything with 4GB RAM)
Write a classification prompt (the one in the routing config above works)
Build a routing table (start with 3 tiers: local-small, local-large, cloud-frontier)
Add a fallback chain (always end with a cloud API)
Measure. Adjust thresholds based on actual quality scores.

You don't need our tool to start. A 50-line script that classifies and routes is enough to validate the pattern with your traffic.

But if you want the full system — classification, routing, caching, feedback loops, and an OpenAI-compatible API — that's MĀRGA. We're building it in the open.

Follow along at avyay.ai.

This is the second article in Avyay's infrastructure series. The first, “Running 3 LLMs Across 3 Machines for $0/month”, covers the distributed inference setup that MĀRGA routes across. Next up: semantic caching — why your LLM is answering the same question 50 times a day.

Sources & Tools Referenced

Ollama — local model inference server
Tailscale — WireGuard-based mesh VPN
DeepSeek R1 — reasoning model
Devstral — Mistral's coding model
Qwen3 — Alibaba's general-purpose model
OpenClaw — AI agent orchestration platform