← Back to Blog
Case Study · DevOps RAG · May 2026

From 45-Minute MTTR to 15 Minutes with AI-Powered Runbooks

Runbooks scattered across wikis, Google Docs, and Slack threads. Engineers wasting 30 minutes finding the right procedure during incidents. Here’s how we fixed it.

67%
MTTR Reduction
45
Knowledge Chunks
18
Source Runbooks
<1s
Query Response

The Challenge: Write-Only Runbooks

Every engineering organization writes runbooks. They live in Confluence, Google Docs, GitHub wikis, Notion pages, and Slack bookmarks. Teams spend weeks writing detailed procedures for every failure scenario.

Then an incident happens at 3 AM, and nobody reads them.

The problem isn’t that runbooks don’t exist. It’s that finding the right runbook under pressure is harder than the incident itself:

  • Scattered knowledge. Runbooks live in 4+ different systems. The Kubernetes rollback procedure is in GitHub. The database failover guide is in Confluence. The PagerDuty escalation matrix is in a Google Sheet.
  • Stale content. The deployment guide was last updated 8 months ago. Since then, the team migrated from Helm to ArgoCD. The runbook is worse than useless — it’s actively misleading.
  • Search doesn’t work. You remember a runbook exists about “that thing where the queue backs up.” Good luck finding it with keyword search when you don’t remember the exact title.
  • Context switching. During an incident, switching from terminal → browser → wiki → search → scroll → read adds 5-10 minutes of friction. Every time.
“Our MTTR was 45 minutes. Of that, 30 minutes was finding and reading the right runbook. Only 15 minutes was actually fixing the problem.”

The Solution: DevOps RAG

DevOps RAG is a retrieval-augmented generation system purpose-built for operational knowledge. It ingests all your runbooks — from Git repos, wikis, docs, wherever they live — chunks them, embeds them with OpenAI, and makes them queryable in natural language.

The key difference from generic RAG: DevOps RAG has no UI. Its only interface is MCP (Model Context Protocol), which means any AI-powered coding environment — Claude Code, Cursor, Codex, OpenClaw — can query your operational knowledge without leaving the terminal.

Architecture: Git → Chunks → Embeddings → Answers

┌──────────────────────────────────────────────────────────┐
│                    Ingestion Pipeline                     │
│                                                           │
│  Git Repos ──┐                                            │
│  Markdown ───┤                                            │
│  Confluence ─┤──▶ Chunking ──▶ OpenAI ──▶ Vector Store   │
│  Google Docs ┤    (semantic)    Embeddings  (Pinecone)    │
│  Slack ──────┘                                            │
│                                                           │
│  Webhook: PR merged → re-index affected runbooks          │
└──────────────────────────────────────────────────────────┘
                          │
                          ▼
┌──────────────────────────────────────────────────────────┐
│                    Query Pipeline                         │
│                                                           │
│  User Query ──▶ Embed ──▶ Similarity Search ──▶ Top-K    │
│                              (cosine, k=5)      Chunks   │
│                                                   │      │
│                                              LLM Context │
│                                                   │      │
│                                         Contextual Answer │
│                                         + Citations       │
└──────────────────────────────────────────────────────────┘

Two design decisions make this system work in practice:

  1. Git-native ingestion. Runbooks live in Git, not a wiki. Every PR merge triggers re-indexing. DevOps RAG always has the latest version. No more “this runbook is from 2023” surprises.
  2. MCP-first interface. The primary consumer of operational knowledge in 2026 isn’t a human with a browser — it’s an AI agent with a task. MCP exposes three tools that any compatible agent can call.

MCP Integration: Three Tools, Zero UI

DevOps RAG exposes exactly three MCP tools. This is intentional — fewer tools means agents use them correctly more often:

1. ask_devops — Ask a Question, Get an Answer

// MCP tool: ask_devops
{
  "question": "How do I rollback a failed Kubernetes deployment?",
  "top_k": 5,
  "verbose": true
}

// Response:
{
  "answer": "To rollback a failed Kubernetes deployment:\n\n1. Check current rollout status:\n   kubectl rollout status deployment/<name>\n\n2. Rollback to previous version:\n   kubectl rollout undo deployment/<name>\n\n3. Verify rollback succeeded:\n   kubectl rollout status deployment/<name>\n\n4. If specific revision needed:\n   kubectl rollout undo deployment/<name> --to-revision=<n>",
  "citations": [
    {
      "source": "runbooks/kubernetes-rollback.md",
      "chunk": "Section 3: Emergency Rollback Procedure",
      "relevance": 0.94
    },
    {
      "source": "runbooks/deployment-guide.md",
      "chunk": "Section 7: Rollback Strategies",
      "relevance": 0.87
    }
  ]
}

2. search_runbooks — Find Relevant Documents

// MCP tool: search_runbooks
{
  "topic": "database failover"
}

// Response:
{
  "runbooks": [
    {
      "source": "runbooks/postgres-failover.md",
      "title": "PostgreSQL Failover Procedure",
      "relevance": 0.92
    },
    {
      "source": "runbooks/rds-disaster-recovery.md",
      "title": "RDS Multi-AZ Failover",
      "relevance": 0.85
    }
  ]
}

3. list_sources — Inventory Your Knowledge Base

// MCP tool: list_sources
{}

// Response:
{
  "total_chunks": 45,
  "total_sources": 18,
  "sbom_components": 240,
  "sources": [
    "runbooks/kubernetes-rollback.md",
    "runbooks/postgres-failover.md",
    "runbooks/incident-escalation.md",
    "runbooks/deployment-guide.md",
    ...
  ]
}

Setup: 5 Minutes to Queryable Runbooks

Adding DevOps RAG to your coding environment takes one configuration block:

// Claude Code / Cursor / OpenClaw MCP config
{
  "mcpServers": {
    "devops-rag": {
      "command": "node",
      "args": ["/path/to/devops-rag-mcp/index.js"],
      "env": {
        "DEVOPS_RAG_URL": "https://devops-rag.avyay.ai",
        "DEVOPS_RAG_API_KEY": "your-api-key"
      }
    }
  }
}

Or deploy with Docker for self-hosted environments:

# Docker deployment
docker run -d \
  --name devops-rag \
  -p 8080:8080 \
  -e OPENAI_API_KEY=sk-... \
  -e PINECONE_API_KEY=... \
  -v ./runbooks:/app/runbooks \
  ghcr.io/gaurav21/devops-rag:latest

# Ingest your runbooks
curl -X POST http://localhost:8080/api/ingest \
  -H "Content-Type: application/json" \
  -d '{"source_dir": "/app/runbooks"}'

# Query
curl http://localhost:8080/api/ask \
  -H "Content-Type: application/json" \
  -d '{"question": "How do I scale the worker pool?"}'

Real Data: 18 Sources, 45 Chunks, Sub-Second Retrieval

Here’s what our production DevOps RAG instance looks like:

MetricValue
Total knowledge chunks45
Source runbooks18
SBOM components tracked240
Average query latency<800ms
Embedding modelOpenAI text-embedding-3-small
Vector dimensions1536
Similarity metricCosine

Example: Incident at 2 AM

Here’s how an incident plays out with DevOps RAG vs without:

StepWithout DevOps RAGWith DevOps RAG
Alert firesOpen PagerDuty (2 min)Open PagerDuty (2 min)
Find runbookSearch Confluence, Slack, GitHub (15 min)ask_devops "queue backing up" (10 sec)
Read & understandScroll through 20-page doc (10 min)Get specific steps with context (30 sec)
Execute fixFollow (possibly outdated) steps (15 min)Follow current, cited steps (12 min)
Total MTTR~42 min~15 min

The biggest time savings aren’t in the fix itself — they’re in eliminating the search-and-read overhead. Instead of 25 minutes finding and parsing a runbook, the on-call engineer asks one question and gets actionable steps with citations in under a second.


Always Up to Date: Git-Native Ingestion

The #1 failure mode of internal knowledge systems is stale content. DevOps RAG solves this by treating runbooks as code:

# GitHub webhook: re-index on PR merge
# .github/workflows/reindex-runbooks.yml
name: Re-index Runbooks

on:
  push:
    branches: [main]
    paths:
      - 'runbooks/**'

jobs:
  reindex:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Trigger re-indexing
        run: |
          curl -X POST https://devops-rag.avyay.ai/api/ingest \
            -H "Authorization: Bearer ${{ secrets.DEVOPS_RAG_KEY }}" \
            -H "Content-Type: application/json" \
            -d '{"source_dir": "runbooks/", "force": true}'

When an engineer updates a runbook (via PR, reviewed and merged like any other code change), the webhook triggers re-indexing automatically. The vector embeddings are refreshed within minutes. No manual sync. No “remember to update the wiki.”


The Results: 67% MTTR Reduction

MetricBeforeAfterChange
Mean time to resolution45 min15 min-67%
Time finding runbooks25 min<1 min-96%
Runbook coverage~60% (unknown gaps)100% (auditable)+40%
Runbook freshnessMonths (manual updates)Minutes (auto-reindex)Real-time
Knowledge accessibilityBrowser + searchTerminal + natural language

The Compound Effect

The MTTR improvement is the headline number, but the real value compounds over time:

  • New engineers onboard faster. Instead of “go read the wiki,” they ask questions and get answers. The learning curve for operational knowledge drops from weeks to days.
  • Runbooks actually get written. When runbooks are consumed by AI (not humans scrolling), there’s less pressure for perfect formatting and more emphasis on accurate content. Engineers write more because the friction is lower.
  • Agents handle routine incidents. When MĀRGA detects a provider outage, the on-call agent queries DevOps RAG for the failover procedure and executes it autonomously. Human intervention needed only for novel incidents.
  • SBOM tracking comes free. With 240 components tracked, DevOps RAG also serves as an inventory of your software supply chain — queryable with the same natural language interface.

Get Started with DevOps RAG

# Option 1: Docker (self-hosted)
docker pull ghcr.io/gaurav21/devops-rag:latest
docker run -p 8080:8080 \
  -e OPENAI_API_KEY=sk-... \
  -v ./runbooks:/app/runbooks \
  ghcr.io/gaurav21/devops-rag:latest

# Option 2: MCP server (for Claude Code / Cursor)
npm install -g @avyay/devops-rag-mcp

# Option 3: REST API
curl https://devops-rag.avyay.ai/api/ask \
  -H "Authorization: Bearer your-key" \
  -d '{"question": "How do I rollback a deployment?"}'
  • MCP Server: Available for Claude Code, Cursor, and OpenClaw
  • Documentation: docs.avyay.ai/devops-rag
  • REST API: Docker deployment or managed endpoint

Gaurav Sharma is the founder of Avyay (अव्यय). DevOps RAG is part of the Avyay platform’s operational intelligence layer. Read the full architecture at avyay.ai/blog/avyay-architecture.

Try DevOps RAG

Make Your Runbooks Actually Useful

Stop searching. Start asking. DevOps RAG turns your scattered operational knowledge into instant, AI-queryable intelligence.

Get Started with DevOps RAG →