Self-Healing Systems — When Code Fixes Itself — Avyay AI

📚 Series: Building Autonomous Software

Autonomous Software Architecture — Beyond Traditional Programming
Self-Healing Systems — When Code Fixes Itself (You are here)
Adaptive Algorithms — AI That Improves AI
Scaling Autonomous Systems — Lessons from 300+ Auto-Builds

Most software treats failure like an exception. Something goes wrong, an alert fires, a human investigates, a fix is deployed. The recovery loop is measured in minutes at best, hours typically, days often.

Self-healing systems treat failure like a feature. Not something to be avoided, but something to be designed for. The system expects things to break. The architecture assumes components will fail. And the recovery path is automated, tested, and continuously improved — by the system itself.

At Avyay, our autonomous build engine runs 24/7 across distributed Mac nodes connected via Tailscale. Nodes go offline. Network connections drop. Generated code has bugs. LLM APIs return garbage. Memory leaks accumulate. All of these happen regularly. None of them stop the system.

Here's how we built self-healing into every layer.

The Five Layers of Self-Healing

Self-healing isn't a single mechanism. It's a stack of increasingly sophisticated recovery strategies, each handling a different failure class:

Layer	Failure Class	Recovery Strategy	Recovery Time
L1: Retry	Transient errors (network, API rate limits)	Exponential backoff with jitter	1-30 seconds
L2: Redirect	Node failure, resource exhaustion	Reroute task to healthy node	10-60 seconds
L3: Rewrite	Code generation failure, test failures	Analyze error, regenerate with modified strategy	2-15 minutes
L4: Restructure	Systemic failure, repeated pattern	Modify task decomposition, change approach	15-60 minutes
L5: Escalate	Unknown failure, safety boundary	Alert human, reduce autonomy, preserve state	Human-dependent

The layers are tried in order. Most failures resolve at L1 or L2. The system only reaches L5 about 6% of the time — and that percentage is shrinking as the feedback loop captures more patterns.

Layer 1: Intelligent Retry — Not Just "Try Again"

Every distributed system has retries. Most implement them badly. The standard approach — exponential backoff with a fixed multiplier — ignores why the failure happened. Our retry layer analyzes the error before deciding how to retry.

interface RetryContext {
  error: Error;
  attempt: number;
  elapsed: number;
  errorHistory: ErrorSignature[];  // Recent errors of this type
}

class IntelligentRetry {
  async shouldRetry(ctx: RetryContext): Promise<RetryDecision> {
    const classification = this.classifyError(ctx.error);
    
    switch (classification) {
      case 'rate_limit':
        // Parse Retry-After header if present
        const retryAfter = this.parseRetryAfter(ctx.error);
        return {
          retry: true,
          delay: retryAfter || this.exponentialDelay(ctx.attempt),
          strategy: 'wait_and_retry',
          note: 'Rate limited — respecting server backoff',
        };
        
      case 'transient_network':
        // Check if this node has had multiple network errors recently
        const nodeErrorRate = this.recentErrorRate(ctx, '5m');
        if (nodeErrorRate > 0.3) {
          return {
            retry: false,
            escalate: 'L2',  // Node might be unhealthy — redirect
            note: `Node error rate ${(nodeErrorRate * 100).toFixed(0)}% — escalating`,
          };
        }
        return {
          retry: true,
          delay: Math.min(1000 * Math.pow(2, ctx.attempt) + jitter(500), 30000),
          strategy: 'exponential_backoff',
        };
        
      case 'llm_context_overflow':
        // Don't retry with same context — truncate and retry
        return {
          retry: true,
          delay: 0,
          strategy: 'modify_input',
          modification: 'truncate_context_50_percent',
          note: 'Context too large — retrying with truncated input',
        };
        
      case 'llm_refusal':
        // Model refused the prompt — reframe, don't retry verbatim
        return {
          retry: true,
          delay: 0,
          strategy: 'reframe_prompt',
          note: 'Model refused — reframing task description',
        };
        
      case 'deterministic_error':
        // This will fail every time — don't waste retries
        return {
          retry: false,
          escalate: 'L3',  // Needs code rewrite, not retry
          note: 'Deterministic failure — escalating to rewrite',
        };
        
      default:
        return ctx.attempt < 3
          ? { retry: true, delay: 2000 * ctx.attempt, strategy: 'generic_backoff' }
          : { retry: false, escalate: 'L2' };
    }
  }
  
  private classifyError(error: Error): string {
    // Pattern matching on error signatures
    if (error.message.includes('429') || error.message.includes('rate limit'))
      return 'rate_limit';
    if (error.message.includes('ECONNRESET') || error.message.includes('ETIMEDOUT'))
      return 'transient_network';
    if (error.message.includes('context_length') || error.message.includes('max_tokens'))
      return 'llm_context_overflow';
    if (error.message.includes('content_policy') || error.message.includes('refused'))
      return 'llm_refusal';
    if (error.message.includes('SyntaxError') || error.message.includes('TypeError'))
      return 'deterministic_error';
    return 'unknown';
  }
}

The key difference from standard retries: the retry strategy changes based on the error classification. A rate limit gets patient waiting. A context overflow gets input truncation. A deterministic error skips retries entirely and escalates. This alone reduced our wasted retry attempts by 67%.

Layer 2: Task Redirection — The Hot Swap

When a node goes down mid-task, the traditional approach is to wait for it to come back or alert someone. Our approach: detect the failure, assess what was lost, and reroute to a healthy node within 60 seconds.

// Node health monitoring — continuous heartbeat
class NodeHealthMonitor {
  private readonly HEARTBEAT_INTERVAL = 10_000;  // 10 seconds
  private readonly FAILURE_THRESHOLD = 3;         // 3 missed heartbeats = dead
  
  async onHeartbeatMissed(nodeId: string, missedCount: number) {
    if (missedCount < this.FAILURE_THRESHOLD) {
      // Might be a blip — don't overreact
      this.metrics.record('heartbeat_miss', { nodeId, count: missedCount });
      return;
    }
    
    // Node is probably down. Begin failover.
    const activeTasks = await this.getActiveTasks(nodeId);
    
    for (const task of activeTasks) {
      // Assess recoverability
      const checkpoint = await this.getLastCheckpoint(task.id);
      const progressLost = this.estimateProgressLoss(task, checkpoint);
      
      if (progressLost < 0.2) {
        // Less than 20% progress lost — resume from checkpoint
        await this.redirectTask(task, {
          strategy: 'resume_from_checkpoint',
          checkpoint: checkpoint.id,
          targetNode: await this.selectHealthiestNode(task),
        });
      } else if (progressLost < 0.5) {
        // 20-50% lost — restart with warm context
        await this.redirectTask(task, {
          strategy: 'warm_restart',
          context: await this.buildWarmContext(task, checkpoint),
          targetNode: await this.selectHealthiestNode(task),
        });
      } else {
        // More than 50% lost — full restart but with lessons learned
        await this.redirectTask(task, {
          strategy: 'full_restart_with_hints',
          hints: await this.extractHints(task, checkpoint),
          targetNode: await this.selectHealthiestNode(task),
        });
      }
    }
    
    // Mark node as unhealthy with recovery window
    await this.markUnhealthy(nodeId, {
      reason: 'heartbeat_timeout',
      recoveryCheck: Date.now() + 5 * 60_000,  // Re-check in 5 minutes
    });
  }
}

The "warm restart" strategy is particularly effective. When a task is 30% complete and the node dies, we don't start from scratch. We extract the partial context — what files were generated, what decisions were made, what tests passed — and provide that context to the new node. It's like handing off a partially-completed puzzle with a note about which pieces you've already placed.

Real-world impact: in the last month, we've had 23 node failures. Average recovery time: 34 seconds. Zero required human intervention.

Layer 3: The Self-Rewriting Build

This is where it gets interesting. When generated code fails tests, the system doesn't just retry the generation. It diagnoses the failure and rewrites with a corrective strategy.

Here's a real example from our build log:

BUILD LOG — Task T-1847: "Add rate limiting to MĀRGA /route endpoint"
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

[03:17:22] Attempt 1: Generated rate limiter using in-memory token bucket
[03:17:45] Running tests... 3/7 passed
[03:17:46] FAILURES:
  ✗ test_concurrent_requests: Race condition — counter not atomic
  ✗ test_distributed_rate_limit: Each instance has separate counter
  ✗ test_persistence_across_restarts: In-memory state lost
  ✗ test_redis_fallback: Redis client not initialized

[03:17:47] L3 Diagnosis:
  Root cause: In-memory implementation doesn't meet distributed requirements
  Required: Distributed rate limiter with atomic operations
  Missing: Redis integration, atomic counters, fallback handling
  Confidence: 0.89

[03:17:48] Attempt 2: Regenerating with constraints:
  - MUST use Redis for distributed state
  - MUST use INCR for atomic counting
  - MUST handle Redis unavailability (degrade to in-memory)
  - MUST include TTL-based window expiration

[03:18:14] Running tests... 6/7 passed
[03:18:15] FAILURES:
  ✗ test_sliding_window_accuracy: Fixed window, not sliding window

[03:18:16] L3 Diagnosis (incremental):
  Root cause: Implementation uses fixed window, tests expect sliding window
  Delta: Replace fixed window with sliding window log algorithm
  Confidence: 0.94

[03:18:17] Attempt 3: Patching (not full regeneration — preserving working code)
  - Replacing window implementation only
  - Keeping Redis integration, fallback, TTL handling

[03:18:38] Running tests... 7/7 passed ✓
[03:18:39] Quality score: 0.87
[03:18:40] Deploying to staging...

Total time: 78 seconds (3 attempts, 2 rewrites)
Human involvement: none

Notice the progression: Attempt 1 fails broadly. Attempt 2 fixes the fundamental approach but misses a detail. Attempt 3 patchesrather than regenerating from scratch — preserving the working code and only fixing what's broken. The system learns to be surgical, not scorched-earth.

The diagnosis engine that drives this:

class FailureDiagnostics {
  async diagnose(task: Task, testResults: TestResult[]): Promise<Diagnosis> {
    const failures = testResults.filter(t => !t.passed);
    
    // Classify failures by type
    const classified = failures.map(f => ({
      test: f.name,
      error: f.error,
      category: this.categorize(f),
      // Is this a design problem or an implementation problem?
      scope: this.assessScope(f, task),
    }));
    
    // Check if failures share a root cause
    const clusters = this.clusterFailures(classified);
    
    if (clusters.length === 1) {
      // All failures stem from one root cause — fix it
      return {
        rootCause: clusters[0].commonCause,
        strategy: 'targeted_fix',
        constraints: this.deriveConstraints(clusters[0]),
        preserveWorking: true,
        confidence: 0.9,
      };
    }
    
    if (clusters.length <= 3) {
      // Multiple independent issues — iterative fixing
      return {
        rootCause: 'multiple_independent_issues',
        strategy: 'regenerate_with_constraints',
        constraints: clusters.flatMap(c => this.deriveConstraints(c)),
        preserveWorking: false,  // Too many issues to patch
        confidence: 0.75,
      };
    }
    
    // Widespread failure — fundamental approach is wrong
    return {
      rootCause: 'architectural_mismatch',
      strategy: 'full_regeneration',
      constraints: this.deriveArchitecturalConstraints(classified),
      preserveWorking: false,
      confidence: 0.6,
      escalateIf: 'still_failing_after_regeneration',
    };
  }
  
  private categorize(failure: TestResult): string {
    const msg = failure.error.toLowerCase();
    if (msg.includes('race') || msg.includes('concurrent') || msg.includes('atomic'))
      return 'concurrency';
    if (msg.includes('not found') || msg.includes('undefined') || msg.includes('null'))
      return 'missing_implementation';
    if (msg.includes('timeout') || msg.includes('connection'))
      return 'integration';
    if (msg.includes('assert') || msg.includes('expected'))
      return 'logic_error';
    return 'unknown';
  }
}

Layer 4: Structural Adaptation

Sometimes the problem isn't the code — it's the task. If a task fails three times at L3 (code rewrite), the system escalates to L4: it questions whether the task itself is well-defined.

Here's a real example. A task was defined as: "Add comprehensive logging to RAKṢĀ scan pipeline." It failed three times because "comprehensive logging" is ambiguous. Each regeneration added different logging, and the tests (which expected specific log formats) kept failing.

L4 intervention:

[04:42:17] L4 Structural Adaptation triggered for T-1923
  
  Original task: "Add comprehensive logging to RAKṢĀ scan pipeline"
  L3 attempts: 3 (all failed)
  Failure pattern: Ambiguous specification → inconsistent implementation
  
  L4 action: DECOMPOSE
  Splitting into subtasks with explicit acceptance criteria:
  
  T-1923a: "Add structured JSON logging to RAKṢĀ scan initiation"
    Acceptance: Log entry with {scanId, target, ruleCount, timestamp}
    Format: JSON to stdout
    
  T-1923b: "Add progress logging to RAKṢĀ rule evaluation loop"
    Acceptance: Log entry per rule with {scanId, ruleId, result, durationMs}
    Format: JSON to stdout, max 1 log per rule
    
  T-1923c: "Add summary logging to RAKṢĀ scan completion"
    Acceptance: Log entry with {scanId, totalRules, passed, failed, duration}
    Format: JSON to stdout
    
  T-1923d: "Add error logging with context to RAKṢĀ failure paths"
    Acceptance: Log entry with {scanId, error, stackTrace, lastSuccessfulStep}
    Format: JSON to stderr
  
  Result: All 4 subtasks completed on first attempt (L1 success)
  Total time: 12 minutes
  Lesson captured: "Avoid 'comprehensive' or 'complete' in task descriptions —
    always specify exact outputs and formats"

The lesson capture is critical. The system doesn't just fix the immediate problem — it updates its task generation heuristics to avoid creating ambiguous tasks in the future. This is how the system's first-attempt success rate climbed from 54% to 72% over six weeks.

Circuit Breakers That Evolve

Traditional circuit breakers have three states: closed (normal), open (failing, reject all), half-open (testing recovery). The thresholds are set by a human at deployment time and never change.

Our circuit breakers have dynamic thresholds that adjust based on observed behavior:

class AdaptiveCircuitBreaker {
  private failureThreshold: number;
  private recoveryTimeout: number;
  private readonly history: CircuitEvent[] = [];
  
  constructor(
    private readonly name: string,
    initialThreshold = 5,
    initialRecoveryMs = 30_000,
  ) {
    this.failureThreshold = initialThreshold;
    this.recoveryTimeout = initialRecoveryMs;
  }
  
  async onStateChange(from: State, to: State, context: CircuitContext) {
    this.history.push({ from, to, timestamp: Date.now(), context });
    
    if (to === 'open') {
      // Circuit just opened — analyze why
      const recentTrips = this.history
        .filter(e => e.to === 'open')
        .slice(-10);
      
      if (recentTrips.length >= 3) {
        const avgTimeBetweenTrips = this.averageInterval(recentTrips);
        
        if (avgTimeBetweenTrips < 5 * 60_000) {
          // Tripping every <5 minutes — something is fundamentally wrong
          // Increase recovery timeout to stop thrashing
          this.recoveryTimeout = Math.min(
            this.recoveryTimeout * 2,
            5 * 60_000  // Cap at 5 minutes
          );
          this.emit('threshold_adjusted', {
            reason: 'rapid_tripping',
            newRecoveryTimeout: this.recoveryTimeout,
          });
        }
      }
    }
    
    if (from === 'half-open' && to === 'closed') {
      // Successful recovery — was our threshold too sensitive?
      const lastOpenDuration = this.lastStateDuration('open');
      
      if (lastOpenDuration < this.recoveryTimeout * 0.3) {
        // Recovered much faster than expected — lower threshold
        this.failureThreshold = Math.max(
          this.failureThreshold - 1,
          2  // Never go below 2
        );
        this.emit('threshold_adjusted', {
          reason: 'fast_recovery',
          newThreshold: this.failureThreshold,
        });
      }
    }
    
    if (from === 'half-open' && to === 'open') {
      // Recovery attempt failed — increase threshold
      this.failureThreshold = Math.min(
        this.failureThreshold + 1,
        15  // Cap at 15
      );
      this.recoveryTimeout = Math.min(
        this.recoveryTimeout * 1.5,
        5 * 60_000
      );
    }
  }
}

The result: our circuit breakers' mean time to recover has decreased by 41% over three months because the thresholds have tuned themselves to the actual failure patterns of each service.

Health Checks That Write Themselves

One of our more unexpected innovations: health checks that generate new health checks based on observed failure patterns.

When the system encounters a failure mode it hasn't seen before, it adds a new health check to catch that failure earlier next time:

// After a memory leak crashed node-2 at 03:17
// The system generated this health check automatically:

{
  "check_id": "auto-hc-2847",
  "generated_from": "incident-2026-05-14-0317",
  "description": "Memory growth rate check — detect leaks before OOM",
  "target": "all_nodes",
  "interval": "60s",
  "check": {
    "type": "rate_of_change",
    "metric": "process.memory.rss",
    "window": "10m",
    "threshold": {
      "warn": "50MB_per_10min",  // Memory growing fast
      "critical": "200MB_per_10min"  // Almost certainly a leak
    }
  },
  "action_on_warn": "increase_gc_frequency",
  "action_on_critical": "drain_and_restart_node",
  "auto_generated": true,
  "confidence": 0.82,
  "approved_by": "system",  // Below risk threshold for human approval
  "effective_since": "2026-05-14T03:45:00Z"
}

We started with 12 hand-written health checks. The system has since generated 34 additional ones. Of those 34, we reviewed and kept 28 (82% useful). The 6 we removed were either redundant or had false positive rates above 10%.

The Recovery Budget: When Healing Costs Too Much

Self-healing sounds great until you realize it has a cost. Each retry consumes compute. Each code regeneration burns LLM tokens. Each task redirect delays downstream work. Without limits, a self-healing system can spend more resources recovering from failures than doing actual work.

We implement a recovery budget — a time and cost cap on how much the system will spend trying to fix a single failure:

interface RecoveryBudget {
  maxAttempts: number;           // Total attempts across all layers
  maxTimeMs: number;             // Wall-clock time limit
  maxTokenCost: number;          // LLM token budget for regeneration
  maxCascadeDepth: number;       // How many downstream tasks can be affected
  
  // Dynamic adjustment based on task value
  adjustedBudget(task: Task): RecoveryBudget {
    const importance = task.priority / 10;  // 0-1 scale
    const downstream = task.dependents.length;
    const multiplier = 1 + (importance * 0.5) + (downstream * 0.3);
    
    return {
      maxAttempts: Math.ceil(this.maxAttempts * multiplier),
      maxTimeMs: Math.ceil(this.maxTimeMs * multiplier),
      maxTokenCost: Math.ceil(this.maxTokenCost * multiplier),
      maxCascadeDepth: this.maxCascadeDepth,  // Safety limit — never adjust
    };
  }
}

// Default budgets by layer
const DEFAULT_BUDGETS: Record<string, RecoveryBudget> = {
  L1: { maxAttempts: 5, maxTimeMs: 60_000, maxTokenCost: 0, maxCascadeDepth: 0 },
  L2: { maxAttempts: 3, maxTimeMs: 120_000, maxTokenCost: 0, maxCascadeDepth: 2 },
  L3: { maxAttempts: 3, maxTimeMs: 900_000, maxTokenCost: 50_000, maxCascadeDepth: 3 },
  L4: { maxAttempts: 2, maxTimeMs: 3_600_000, maxTokenCost: 200_000, maxCascadeDepth: 5 },
};

The recovery budget has saved us from runaway healing loops twice. In one case, a misconfigured test fixture caused every code generation to fail, and without the budget, the system would have burned through hundreds of dollars in LLM tokens retrying a fundamentally impossible task.

Production Numbers: Self-Healing in Practice

Metric	Value	Trend
Failures auto-healed (last 30 days)	847	—
Healed at L1 (retry)	62%	↑ from 48% (better error classification)
Healed at L2 (redirect)	18%	↓ from 24% (fewer node failures)
Healed at L3 (rewrite)	14%	Stable
Healed at L4 (restructure)	4%	↓ from 8% (better task specs)
Escalated to human (L5)	2%	↓ from 12% in month 1
Mean time to recovery (auto)	47 seconds	↓ from 3.2 minutes
Recovery budget exceeded	0.8%	Stable (correctly escalated)
Auto-generated health checks	34 (28 kept)	82% retention rate

The most important number: human escalations dropped from 12% to 2% over six weeks. That's the self-healing system learning to handle edge cases that previously required us to wake up at 3 AM.

The Philosophical Shift

Building self-healing systems changed how we think about software reliability. The old model: prevent all failures. The new model: make failure cheap.

When recovery is fast and automatic, the calculus changes. You stop over-engineering prevention and start investing in detection and recovery. You stop writing defensive code that handles every edge case and start writing adaptive code that handles the common case and heals through the rest.

The most reliable systems aren't the ones that never fail. They're the ones that fail so gracefully you never notice.

In Part 3: Adaptive Algorithms, we'll explore how these recovery mechanisms feed into a system that improves its own performance over time — AI that literally improves AI.

All examples are from production systems running at Avyay as of May 2026. Build logs have been lightly edited for clarity. The self-healing framework is part of our build engine and will be open-sourced as a standalone library later this year.

Self-Healing Systems: When Code Fixes Itself