Avyay AI — The Imperishable Intelligence Platform

Most “AI platform” architectures you see on conference slides are aspirational fiction. Clean boxes connected by straight arrows, each labeled with a service name that implies a dedicated team of six engineers. The reality behind those slides is usually a monolith with a REST API and a dream.

Our architecture is different — not because it’s cleaner, but because it actually runs. In production. Right now. With two people.

Avyay (अव्यय — Sanskrit for “imperishable”) is a distributed AI platform built from five microservices, each with a Sanskrit name and a specific job. They run across consumer hardware and cloud functions, connected by a Tailscale mesh network and orchestrated by an autonomous build engine that ships code at 3 AM.

This article is the technical deep-dive. No hand-waving. No “we’ll cover that in a future post.” Every component, every routing decision, every trade-off — laid bare.

The Five Pillars: Why Microservices for a Two-Person Company

The conventional wisdom says microservices are for large teams. You need service ownership, on-call rotations, inter-team contracts. A two-person company should build a monolith.

We disagree, and here’s why: AI agents don’t care about your team topology.

When your primary “engineers” are AI coding agents dispatched by an autonomous build engine, the arguments against microservices invert:

Context boundaries help agents. A monolith with 50,000 lines of Go is worse for a coding agent than five services with 5,000 lines each. The agent can load the entire service into context, understand it fully, and make changes without side effects it can’t predict.
Independent deployment means independent testing. Each service has its own CI pipeline, its own integration tests, its own deployment target. An agent can ship RAKṢĀ without risking MĀRGA’s uptime.
Language diversity becomes free. MĀRGA is Go (performance-critical routing). DevOps RAG is Python (ML ecosystem). SIDDHI is TypeScript (rapid prototyping). With agents writing the code, the “we can only hire for one language” constraint disappears.
Failure isolation is worth the complexity. When SIDDHI’s hypothesis engine encounters a rate limit from Reddit’s API, MĀRGA keeps routing LLM requests without a blip.

Here are the five services:

Service	Sanskrit	Language	Purpose
MĀRGA	मार्ग (path)	Go	LLM routing, cost optimization, failover
DevOps RAG	—	Python	Runbook retrieval, incident intelligence
RAKṢĀ	रक्षा (protection)	Python	Code security scanning, SAST/SCA
SIDDHI	सिद्धि (accomplishment)	TypeScript	Autonomous PMF discovery engine
Gateway	—	Go	API gateway, auth, rate limiting

MĀRGA: The LLM Router That Saves You 40% on API Costs

MĀRGA (मार्ग — “path” or “way”) is the first service every request touches. It’s an enterprise-grade LLM router that sits between your application and every AI provider — OpenAI, Anthropic, local Ollama instances, Azure OpenAI — and makes intelligent routing decisions.

Why Not Just Call the API Directly?

Because “call the API directly” fails in three predictable ways:

Cost explosion.Claude Opus 4 costs $15 per million input tokens. If every request — including simple classification tasks — goes to Opus, you’re burning money. MĀRGA routes simple queries to cheap models (Qwen 4B locally, $0/request) and reserves expensive models for complex reasoning.

Single point of failure. OpenAI goes down approximately once every two weeks. MĀRGA maintains a provider pool with circuit breakers, automatic failover, and health checks.

Zero visibility.When you call APIs directly from 15 different places in your codebase, you have no idea what you’re spending. MĀRGA provides centralized metrics, traces, and cost attribution — all shipped to Datadog via native APM integration.

The Routing Engine

type RoutingEngine struct {
    CostOptimizer    *CostOptimizer
    ABTestManager    *ABTestManager
    ComplianceFilter *ComplianceFilter
    ProviderPool     *ProviderPool
    CircuitBreakers  map[string]*CircuitBreaker
}

func (r *RoutingEngine) Route(ctx context.Context, req *LLMRequest) (*Provider, error) {
    // Step 1: Compliance filtering (PII detection, data residency)
    if err := r.ComplianceFilter.Check(req); err != nil {
        return nil, fmt.Errorf("compliance: %w", err)
    }
    
    // Step 2: A/B test assignment (if active experiments)
    if provider := r.ABTestManager.Assign(req); provider != nil {
        return provider, nil
    }
    
    // Step 3: Cost-based tier selection
    tier := r.CostOptimizer.SelectTier(req)
    
    // Step 4: Select healthy provider from tier
    providers := r.ProviderPool.GetByTier(tier)
    for _, p := range providers {
        if r.CircuitBreakers[p.Name].Allow() {
            return p, nil
        }
    }
    
    // Step 5: Fallback chain
    return r.ProviderPool.GetFallback()
}

The cost optimizer estimates query complexity from input length, presence of system prompts, and request metadata. Simple queries go to Tier 1 (local models, $0). Medium complexity goes to Tier 2 (Sonnet, Haiku). Complex reasoning goes to Tier 3 (Opus, GPT-4).

In practice, this saves 30-40% on API costs. The key insight: 80% of LLM calls in a typical application are simple enough for a $0 local model.

Circuit Breakers and Failover

MĀRGA implements a three-state circuit breaker per provider:

CLOSED (normal) → OPEN (provider failing) → HALF-OPEN (testing recovery)

When a provider returns 5 consecutive errors (or error rate exceeds 50% over a 60-second window), the circuit trips to OPEN. After a configurable cooldown (default: 30 seconds), the circuit moves to HALF-OPEN and sends a single probe request. If it succeeds, the circuit closes. If it fails, it reopens.

This means MĀRGA can survive an OpenAI outage with zero user-visible impact — requests automatically route to Anthropic or local models within seconds.

Multi-Protocol: Drop-In OpenAI Replacement

from openai import OpenAI

client = OpenAI(
    base_url="http://marga.internal:8080/v1",
    api_key="your-marga-key"
)

# This request gets intelligently routed
response = client.chat.completions.create(
    model="auto",  # Let MĀRGA choose
    messages=[{"role": "user", "content": "Summarize this document..."}]
)

The model="auto" parameter tells MĀRGA to use its routing engine. You can also pin to a specific model and MĀRGA will handle failover within that model class.

DevOps RAG: Turning Write-Only Runbooks Into On-Call Intelligence

Every engineering organization has runbooks. Nobody reads them at 3 AM when production is down.

DevOps RAG is our retrieval-augmented generation system purpose-built for operational knowledge. It ingests runbooks, incident postmortems, deployment guides, and troubleshooting procedures — then makes them queryable in natural language, directly from your coding environment via MCP (Model Context Protocol).

The MCP-First Architecture

Runbooks (Git repos) → Chunking → Embeddings → Vector Store (Pinecone)
                                                      ↓
User Query → Embedding → Similarity Search → Top-K Chunks → LLM → Answer

Two design choices make DevOps RAG different:

Git-native ingestion. Runbooks live in Git, not a wiki. Every time a runbook changes (PR merged), a webhook triggers re-indexing. DevOps RAG always has the latest version — unlike a Confluence page last updated in 2023.

MCP-first interface. DevOps RAG has no UI. Zero. Its only interface is the MCP protocol, exposing three tools: ask_devops, search_runbooks, and list_sources. Any MCP-compatible agent (Claude Code, Cursor, Codex, OpenClaw) can query your operational knowledge without leaving the IDE.

{
  "mcpServers": {
    "devops-rag": {
      "command": "node",
      "args": ["/path/to/devops-rag-mcp/index.js"],
      "env": {
        "DEVOPS_RAG_URL": "https://devops-rag.avyay.ai"
      }
    }
  }
}

In an AI-native stack, the primary consumer of knowledge systems isn’t a human with a browser — it’s an agent with a task. The agent needs structured, citation-backed answers it can act on, not a search results page.

RAKṢĀ: Security Scanning That Runs on Every Commit

When AI agents write code, they introduce a new class of security risk. An agent doesn’t know that os.system(user_input) is a command injection vulnerability — it just knows the code compiles and the tests pass.

We caught our own coding agents committing hardcoded API keys, fabricating email addresses (a dev@avyay.aithat didn’t exist), and writing SQL queries with string concatenation.

Three Interfaces, One Engine

RAKṢĀ ships in three forms, all backed by the same cloud scanning engine:

1. CLI — For local development. Rich terminal output with color-coded severity tables.

$ raksha scan --severity critical --severity high
🛡️ RAKṢĀ Security Scan
━━━━━━━━━━━━━━━━━━━━━━
Scanning: /home/user/project
Files analyzed: 147 | Time: 3.2s

┌────────┬──────────┬──────────────────────────────────┐
│ Sev    │ File     │ Finding                          │
├────────┼──────────┼──────────────────────────────────┤
│ CRIT   │ auth.py  │ Hardcoded secret in source       │
│ HIGH   │ api.py   │ SQL injection via string concat  │
└────────┴──────────┴──────────────────────────────────┘

2. GitHub Action — Drop-in CI integration. Uploads SARIF to GitHub Code Scanning and fails builds if findings exceed the severity threshold.

- name: Run RAKṢĀ Security Scan
  uses: avyay/raksha-scan-action@v1
  with:
    severity_threshold: high

- name: Upload SARIF to GitHub Security
  uses: github/codeql-action/upload-sarif@v3
  with:
    sarif_file: raksha-results.sarif

3. Cloud API — For programmatic access. Send code, get SARIF back. This is what the CLI and GitHub Action use under the hood.

Existing scanners (Snyk, SonarQube) are designed for human review workflows: scan → dashboard → assign → fix. When agents write code, you need agent-oriented workflows: scan → block → report → auto-fix. RAKṢĀ’s SARIF output feeds directly into the agent’s context, so the same agent that wrote the vulnerable code can fix it in the same session.

SIDDHI: The Autonomous PMF Engine

SIDDHI (सिद्धि — “accomplishment”) is the most ambitious component: an autonomous engine that discovers product-market fit without human intervention.

It runs a continuous five-stage pipeline:

┌─────────┐    ┌─────────┐    ┌──────────┐    ┌─────────┐    ┌─────────┐
│  MINE   │───▶│  SCORE  │───▶│ GENERATE │───▶│  TEST   │───▶│  LEARN  │
│         │    │         │    │          │    │         │    │         │
│ Reddit  │    │ LLM     │    │ Landing  │    │ Micro   │    │Bayesian │
│ HN      │    │ scoring │    │ pages    │    │ ads     │    │ stats   │
│ Twitter │    │ market  │    │ AI copy  │    │ Google  │    │ Cross-  │
│ PH      │    │ sizing  │    │ deploy   │    │ Meta    │    │ learn   │
└─────────┘    └─────────┘    └──────────┘    └─────────┘    └─────────┘

The orchestrator manages this pipeline through a state machine where every hypothesis follows a lifecycle: DISCOVERED → SCORED → GENERATING → TESTING → ANALYZING — then either PROMISING (iterate), VALIDATED (graduate to product), or KILLED (insufficient signal).

A budget governor enforces hard spending limits — per hypothesis ($30 max), per day, per month. SIDDHI routes all its LLM calls through MĀRGA, creating a natural cost hierarchy: scoring uses free local models, copy generation uses Sonnet (~$3/M tokens), and kill/expand decisions use Opus (~$15/M tokens).

The Gateway: Auth, Rate Limiting, and the API Surface

The Gateway is the entry point for all external traffic. It’s a Go service that handles authentication (API keys + JWT), rate limiting (per key, per endpoint, per tier), usage tracking, and request routing.

Client Request
      │
      ▼
┌──────────┐
│ Gateway  │
│          │
│ Auth     │──▶ Reject (401)
│ Rate     │──▶ Reject (429)
│ Route    │
│ Proxy    │
└────┬─────┘
     │
     ├── /v1/chat/* ──────▶ MĀRGA
     ├── /v1/scan/* ──────▶ RAKṢĀ
     ├── /v1/ask/*  ──────▶ DevOps RAG
     └── /v1/siddhi/* ───▶ SIDDHI

The Gateway is deliberately thin. Separating it from MĀRGA means: MĀRGA stays focused on routing (not JWT validation), other services get auth for free, and rate limiting is centralized across all services.

The Mesh: How Everything Connects

The five services don’t run in a Kubernetes cluster. They run across a mix of cloud functions and consumer hardware, connected by a Tailscale mesh network.

┌─────────────────────────────────────────────────────────────┐
│                 Tailscale Mesh (WireGuard)                   │
│                                                              │
│  ┌──────────────────┐     ┌────────────────────┐            │
│  │ Linux Gateway    │     │ Google Cloud Run    │            │
│  │ ThinkPad X1      │     │ (asia-southeast1)   │            │
│  │                  │     │                     │            │
│  │ • Orchestrator   │     │ • MĀRGA (Go)        │            │
│  │ • Build Engine   │     │ • RAKṢĀ (Python)    │            │
│  │ • Knowledge Graph│     │ • DevOps RAG (Py)   │            │
│  │ • Ollama (Qwen)  │     │ • Gateway (Go)      │            │
│  │ • PostgreSQL     │     │                     │            │
│  └────────┬─────────┘     └──────────┬──────────┘            │
│           │                          │                       │
│           │    ┌─────────────────────┘                       │
│           │    │                                             │
│  ┌────────┴────┴────┐     ┌────────────────────┐            │
│  │ MacBook Pro M1   │     │ MacBook Pro Intel   │            │
│  │ (64GB RAM)       │     │ (16GB RAM)          │            │
│  │                  │     │                     │            │
│  │ • Qwen3 27B     │     │ • DeepSeek R1 7B    │            │
│  │ • Devstral 24B  │     │ • Codex CLI         │            │
│  │ • Build Agent   │     │ • Build Agent       │            │
│  └──────────────────┘     └────────────────────┘            │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Why Tailscale? Three reasons: NAT traversal (MacBooks behind residential NAT), encryption by default (WireGuard, no TLS cert management), and identity-based access (stable *.tailnet DNS names that survive IP changes).

The Autonomous Build Engine

The most unusual part of our architecture isn’t any individual service — it’s how the services get built.

The build engine runs a simple loop: check the queue → if low, generate new tasks by analyzing the codebase → dispatch eligible tasks to idle nodes → collect results → update dependency graph → repeat. Every 30 seconds, forever.

The task queue is a JSON file. Not Redis, not Kafka. At our scale (<100 tasks/day), a JSON file is the right choice — human-readable, git-friendly, zero operational overhead.

Each task is executed by a coding agent (Claude Code or Codex CLI) running on a Mac node. The agent receives the task description, codebase context, and instructions to commit when done. A typical day sees 9 tasks shipped across 5 codebases — with zero human context-switching.

What Most People Miss: The Compound Effects

Each service in isolation is useful but not revolutionary. The architecture becomes interesting when you look at the interactions:

MĀRGA + RAKṢĀ = Cost-aware security scanning. RAKṢĀ routes its LLM calls through MĀRGA. Complex findings get analyzed by expensive models. Simple pattern matches get classified by cheap ones. Security scanning costs drop 40%.
DevOps RAG + MĀRGA = Self-documenting deployment. When deployment fails, the on-call agent queries DevOps RAG for runbooks, uses MĀRGA for analysis, and produces remediation steps in seconds.
SIDDHI + MĀRGA = Autonomous market discovery at controlled cost. 80% of SIDDHI’s calls go to free local models. Total cost per hypothesis stays under $25.
Build Engine + All Services = Continuous improvement. The build engine generates tasks for any service, reads codebases, identifies improvements, and dispatches agents — even while we sleep.

The Trade-offs: What We Got Wrong (So Far)

The JSON task queue should have been SQLite from day one. Concurrent writes from multiple agents occasionally cause corruption. SQLite would give us transactions for zero additional complexity.
Cloud Run cold starts matter more than expected. MĀRGA’s cold start is ~1.8 seconds. For a router, that’s too much. We’re evaluating min-instances (~$15/month) to keep one instance warm.
Agent-written code needs architectural review, not just pattern scanning. RAKṢĀ catches known vulnerabilities, but agents write code that’s technically secure but architecturally wrong — session tokens in URL parameters, GET requests for mutations. Pattern scanning can’t catch these.
Tailscale is great until a node goes offline. When a MacBook lid closes, it drops off the mesh. Two weeks of “why did that task vanish?” debugging to get retry logic right.

Common Mistakes When Building Distributed AI Platforms

Putting the router inside the application. Extract routing into a service. Let applications be dumb clients.
Treating security as a post-deploy check. Scan on every commit — before code reaches any environment.
Building RAG as a feature instead of a service. If your RAG is a module inside your app, it can’t be consumed by other services or agents.
Ignoring the cost of AI-writing-AI. The build engine uses LLM calls to generate tasks, which use LLM calls to write code. Without routing, daily spend would be 3-4x higher.
Over-orchestrating. We tried Temporal. We tried Airflow. The JSON queue + polling loop works better because it’s debuggable. When a task fails, you open a JSON file and read what happened.

What’s Next: The Feedback Loop

The next phase is closing the observability loop. Right now, services produce telemetry that goes to Datadog. Humans look at dashboards and decide what to fix.

The goal: agents look at the telemetry and fix things automatically.

Agent writes code → Deploys → MĀRGA routes → Datadog traces →
  Agent reads traces → Identifies issue → Writes fix → Deploys → ...

MĀRGA already has native Datadog APM integration. Every LLM request produces a trace with model, latency, cost, and token counts. The Datadog MCP server makes this data available to coding agents. The pieces for a self-improving platform are in place.

Try It Yourself

MĀRGA — GitHub | docker pull ghcr.io/gaurav21/marga
RAKṢĀ CLI — pip install raksha-cli | GitHub Action
DevOps RAG — Available as an MCP server for Claude Code, Cursor, and OpenClaw
Avyay Platform — avyay.ai | Join the alpha

The architecture isn’t complicated. Five services, a mesh network, and an autonomous build engine. The hard part isn’t building it — it’s resisting the urge to make it more complicated than it needs to be.

Gaurav Sharma is the founder of Avyay (अव्यय). He builds distributed AI systems using consumer hardware and too much coffee. Follow the build at avyay.ai/blog.

The Avyay Architecture: 5 Microservices, Zero Engineers