- Autonomous Software Architecture — Beyond Traditional Programming
- Self-Healing Systems — When Code Fixes Itself (You are here)
- Adaptive Algorithms — AI That Improves AI
- Scaling Autonomous Systems — Lessons from 300+ Auto-Builds
Most software treats failure like an exception. Something goes wrong, an alert fires, a human investigates, a fix is deployed. The recovery loop is measured in minutes at best, hours typically, days often.
Self-healing systems treat failure like a feature. Not something to be avoided, but something to be designed for. The system expects things to break. The architecture assumes components will fail. And the recovery path is automated, tested, and continuously improved — by the system itself.
At Avyay, our autonomous build engine runs 24/7 across distributed Mac nodes connected via Tailscale. Nodes go offline. Network connections drop. Generated code has bugs. LLM APIs return garbage. Memory leaks accumulate. All of these happen regularly. None of them stop the system.
Here's how we built self-healing into every layer.
The Five Layers of Self-Healing
Self-healing isn't a single mechanism. It's a stack of increasingly sophisticated recovery strategies, each handling a different failure class:
| Layer | Failure Class | Recovery Strategy | Recovery Time |
|---|---|---|---|
| L1: Retry | Transient errors (network, API rate limits) | Exponential backoff with jitter | 1-30 seconds |
| L2: Redirect | Node failure, resource exhaustion | Reroute task to healthy node | 10-60 seconds |
| L3: Rewrite | Code generation failure, test failures | Analyze error, regenerate with modified strategy | 2-15 minutes |
| L4: Restructure | Systemic failure, repeated pattern | Modify task decomposition, change approach | 15-60 minutes |
| L5: Escalate | Unknown failure, safety boundary | Alert human, reduce autonomy, preserve state | Human-dependent |
The layers are tried in order. Most failures resolve at L1 or L2. The system only reaches L5 about 6% of the time — and that percentage is shrinking as the feedback loop captures more patterns.
Layer 1: Intelligent Retry — Not Just "Try Again"
Every distributed system has retries. Most implement them badly. The standard approach — exponential backoff with a fixed multiplier — ignores why the failure happened. Our retry layer analyzes the error before deciding how to retry.
interface RetryContext {
error: Error;
attempt: number;
elapsed: number;
errorHistory: ErrorSignature[]; // Recent errors of this type
}
class IntelligentRetry {
async shouldRetry(ctx: RetryContext): Promise<RetryDecision> {
const classification = this.classifyError(ctx.error);
switch (classification) {
case 'rate_limit':
// Parse Retry-After header if present
const retryAfter = this.parseRetryAfter(ctx.error);
return {
retry: true,
delay: retryAfter || this.exponentialDelay(ctx.attempt),
strategy: 'wait_and_retry',
note: 'Rate limited — respecting server backoff',
};
case 'transient_network':
// Check if this node has had multiple network errors recently
const nodeErrorRate = this.recentErrorRate(ctx, '5m');
if (nodeErrorRate > 0.3) {
return {
retry: false,
escalate: 'L2', // Node might be unhealthy — redirect
note: `Node error rate ${(nodeErrorRate * 100).toFixed(0)}% — escalating`,
};
}
return {
retry: true,
delay: Math.min(1000 * Math.pow(2, ctx.attempt) + jitter(500), 30000),
strategy: 'exponential_backoff',
};
case 'llm_context_overflow':
// Don't retry with same context — truncate and retry
return {
retry: true,
delay: 0,
strategy: 'modify_input',
modification: 'truncate_context_50_percent',
note: 'Context too large — retrying with truncated input',
};
case 'llm_refusal':
// Model refused the prompt — reframe, don't retry verbatim
return {
retry: true,
delay: 0,
strategy: 'reframe_prompt',
note: 'Model refused — reframing task description',
};
case 'deterministic_error':
// This will fail every time — don't waste retries
return {
retry: false,
escalate: 'L3', // Needs code rewrite, not retry
note: 'Deterministic failure — escalating to rewrite',
};
default:
return ctx.attempt < 3
? { retry: true, delay: 2000 * ctx.attempt, strategy: 'generic_backoff' }
: { retry: false, escalate: 'L2' };
}
}
private classifyError(error: Error): string {
// Pattern matching on error signatures
if (error.message.includes('429') || error.message.includes('rate limit'))
return 'rate_limit';
if (error.message.includes('ECONNRESET') || error.message.includes('ETIMEDOUT'))
return 'transient_network';
if (error.message.includes('context_length') || error.message.includes('max_tokens'))
return 'llm_context_overflow';
if (error.message.includes('content_policy') || error.message.includes('refused'))
return 'llm_refusal';
if (error.message.includes('SyntaxError') || error.message.includes('TypeError'))
return 'deterministic_error';
return 'unknown';
}
}The key difference from standard retries: the retry strategy changes based on the error classification. A rate limit gets patient waiting. A context overflow gets input truncation. A deterministic error skips retries entirely and escalates. This alone reduced our wasted retry attempts by 67%.
Layer 2: Task Redirection — The Hot Swap
When a node goes down mid-task, the traditional approach is to wait for it to come back or alert someone. Our approach: detect the failure, assess what was lost, and reroute to a healthy node within 60 seconds.
// Node health monitoring — continuous heartbeat
class NodeHealthMonitor {
private readonly HEARTBEAT_INTERVAL = 10_000; // 10 seconds
private readonly FAILURE_THRESHOLD = 3; // 3 missed heartbeats = dead
async onHeartbeatMissed(nodeId: string, missedCount: number) {
if (missedCount < this.FAILURE_THRESHOLD) {
// Might be a blip — don't overreact
this.metrics.record('heartbeat_miss', { nodeId, count: missedCount });
return;
}
// Node is probably down. Begin failover.
const activeTasks = await this.getActiveTasks(nodeId);
for (const task of activeTasks) {
// Assess recoverability
const checkpoint = await this.getLastCheckpoint(task.id);
const progressLost = this.estimateProgressLoss(task, checkpoint);
if (progressLost < 0.2) {
// Less than 20% progress lost — resume from checkpoint
await this.redirectTask(task, {
strategy: 'resume_from_checkpoint',
checkpoint: checkpoint.id,
targetNode: await this.selectHealthiestNode(task),
});
} else if (progressLost < 0.5) {
// 20-50% lost — restart with warm context
await this.redirectTask(task, {
strategy: 'warm_restart',
context: await this.buildWarmContext(task, checkpoint),
targetNode: await this.selectHealthiestNode(task),
});
} else {
// More than 50% lost — full restart but with lessons learned
await this.redirectTask(task, {
strategy: 'full_restart_with_hints',
hints: await this.extractHints(task, checkpoint),
targetNode: await this.selectHealthiestNode(task),
});
}
}
// Mark node as unhealthy with recovery window
await this.markUnhealthy(nodeId, {
reason: 'heartbeat_timeout',
recoveryCheck: Date.now() + 5 * 60_000, // Re-check in 5 minutes
});
}
}The "warm restart" strategy is particularly effective. When a task is 30% complete and the node dies, we don't start from scratch. We extract the partial context — what files were generated, what decisions were made, what tests passed — and provide that context to the new node. It's like handing off a partially-completed puzzle with a note about which pieces you've already placed.
Real-world impact: in the last month, we've had 23 node failures. Average recovery time: 34 seconds. Zero required human intervention.
Layer 3: The Self-Rewriting Build
This is where it gets interesting. When generated code fails tests, the system doesn't just retry the generation. It diagnoses the failure and rewrites with a corrective strategy.
Here's a real example from our build log:
BUILD LOG — Task T-1847: "Add rate limiting to MĀRGA /route endpoint" ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ [03:17:22] Attempt 1: Generated rate limiter using in-memory token bucket [03:17:45] Running tests... 3/7 passed [03:17:46] FAILURES: ✗ test_concurrent_requests: Race condition — counter not atomic ✗ test_distributed_rate_limit: Each instance has separate counter ✗ test_persistence_across_restarts: In-memory state lost ✗ test_redis_fallback: Redis client not initialized [03:17:47] L3 Diagnosis: Root cause: In-memory implementation doesn't meet distributed requirements Required: Distributed rate limiter with atomic operations Missing: Redis integration, atomic counters, fallback handling Confidence: 0.89 [03:17:48] Attempt 2: Regenerating with constraints: - MUST use Redis for distributed state - MUST use INCR for atomic counting - MUST handle Redis unavailability (degrade to in-memory) - MUST include TTL-based window expiration [03:18:14] Running tests... 6/7 passed [03:18:15] FAILURES: ✗ test_sliding_window_accuracy: Fixed window, not sliding window [03:18:16] L3 Diagnosis (incremental): Root cause: Implementation uses fixed window, tests expect sliding window Delta: Replace fixed window with sliding window log algorithm Confidence: 0.94 [03:18:17] Attempt 3: Patching (not full regeneration — preserving working code) - Replacing window implementation only - Keeping Redis integration, fallback, TTL handling [03:18:38] Running tests... 7/7 passed ✓ [03:18:39] Quality score: 0.87 [03:18:40] Deploying to staging... Total time: 78 seconds (3 attempts, 2 rewrites) Human involvement: none
Notice the progression: Attempt 1 fails broadly. Attempt 2 fixes the fundamental approach but misses a detail. Attempt 3 patchesrather than regenerating from scratch — preserving the working code and only fixing what's broken. The system learns to be surgical, not scorched-earth.
The diagnosis engine that drives this:
class FailureDiagnostics {
async diagnose(task: Task, testResults: TestResult[]): Promise<Diagnosis> {
const failures = testResults.filter(t => !t.passed);
// Classify failures by type
const classified = failures.map(f => ({
test: f.name,
error: f.error,
category: this.categorize(f),
// Is this a design problem or an implementation problem?
scope: this.assessScope(f, task),
}));
// Check if failures share a root cause
const clusters = this.clusterFailures(classified);
if (clusters.length === 1) {
// All failures stem from one root cause — fix it
return {
rootCause: clusters[0].commonCause,
strategy: 'targeted_fix',
constraints: this.deriveConstraints(clusters[0]),
preserveWorking: true,
confidence: 0.9,
};
}
if (clusters.length <= 3) {
// Multiple independent issues — iterative fixing
return {
rootCause: 'multiple_independent_issues',
strategy: 'regenerate_with_constraints',
constraints: clusters.flatMap(c => this.deriveConstraints(c)),
preserveWorking: false, // Too many issues to patch
confidence: 0.75,
};
}
// Widespread failure — fundamental approach is wrong
return {
rootCause: 'architectural_mismatch',
strategy: 'full_regeneration',
constraints: this.deriveArchitecturalConstraints(classified),
preserveWorking: false,
confidence: 0.6,
escalateIf: 'still_failing_after_regeneration',
};
}
private categorize(failure: TestResult): string {
const msg = failure.error.toLowerCase();
if (msg.includes('race') || msg.includes('concurrent') || msg.includes('atomic'))
return 'concurrency';
if (msg.includes('not found') || msg.includes('undefined') || msg.includes('null'))
return 'missing_implementation';
if (msg.includes('timeout') || msg.includes('connection'))
return 'integration';
if (msg.includes('assert') || msg.includes('expected'))
return 'logic_error';
return 'unknown';
}
}Layer 4: Structural Adaptation
Sometimes the problem isn't the code — it's the task. If a task fails three times at L3 (code rewrite), the system escalates to L4: it questions whether the task itself is well-defined.
Here's a real example. A task was defined as: "Add comprehensive logging to RAKṢĀ scan pipeline." It failed three times because "comprehensive logging" is ambiguous. Each regeneration added different logging, and the tests (which expected specific log formats) kept failing.
L4 intervention:
[04:42:17] L4 Structural Adaptation triggered for T-1923
Original task: "Add comprehensive logging to RAKṢĀ scan pipeline"
L3 attempts: 3 (all failed)
Failure pattern: Ambiguous specification → inconsistent implementation
L4 action: DECOMPOSE
Splitting into subtasks with explicit acceptance criteria:
T-1923a: "Add structured JSON logging to RAKṢĀ scan initiation"
Acceptance: Log entry with {scanId, target, ruleCount, timestamp}
Format: JSON to stdout
T-1923b: "Add progress logging to RAKṢĀ rule evaluation loop"
Acceptance: Log entry per rule with {scanId, ruleId, result, durationMs}
Format: JSON to stdout, max 1 log per rule
T-1923c: "Add summary logging to RAKṢĀ scan completion"
Acceptance: Log entry with {scanId, totalRules, passed, failed, duration}
Format: JSON to stdout
T-1923d: "Add error logging with context to RAKṢĀ failure paths"
Acceptance: Log entry with {scanId, error, stackTrace, lastSuccessfulStep}
Format: JSON to stderr
Result: All 4 subtasks completed on first attempt (L1 success)
Total time: 12 minutes
Lesson captured: "Avoid 'comprehensive' or 'complete' in task descriptions —
always specify exact outputs and formats"The lesson capture is critical. The system doesn't just fix the immediate problem — it updates its task generation heuristics to avoid creating ambiguous tasks in the future. This is how the system's first-attempt success rate climbed from 54% to 72% over six weeks.
Circuit Breakers That Evolve
Traditional circuit breakers have three states: closed (normal), open (failing, reject all), half-open (testing recovery). The thresholds are set by a human at deployment time and never change.
Our circuit breakers have dynamic thresholds that adjust based on observed behavior:
class AdaptiveCircuitBreaker {
private failureThreshold: number;
private recoveryTimeout: number;
private readonly history: CircuitEvent[] = [];
constructor(
private readonly name: string,
initialThreshold = 5,
initialRecoveryMs = 30_000,
) {
this.failureThreshold = initialThreshold;
this.recoveryTimeout = initialRecoveryMs;
}
async onStateChange(from: State, to: State, context: CircuitContext) {
this.history.push({ from, to, timestamp: Date.now(), context });
if (to === 'open') {
// Circuit just opened — analyze why
const recentTrips = this.history
.filter(e => e.to === 'open')
.slice(-10);
if (recentTrips.length >= 3) {
const avgTimeBetweenTrips = this.averageInterval(recentTrips);
if (avgTimeBetweenTrips < 5 * 60_000) {
// Tripping every <5 minutes — something is fundamentally wrong
// Increase recovery timeout to stop thrashing
this.recoveryTimeout = Math.min(
this.recoveryTimeout * 2,
5 * 60_000 // Cap at 5 minutes
);
this.emit('threshold_adjusted', {
reason: 'rapid_tripping',
newRecoveryTimeout: this.recoveryTimeout,
});
}
}
}
if (from === 'half-open' && to === 'closed') {
// Successful recovery — was our threshold too sensitive?
const lastOpenDuration = this.lastStateDuration('open');
if (lastOpenDuration < this.recoveryTimeout * 0.3) {
// Recovered much faster than expected — lower threshold
this.failureThreshold = Math.max(
this.failureThreshold - 1,
2 // Never go below 2
);
this.emit('threshold_adjusted', {
reason: 'fast_recovery',
newThreshold: this.failureThreshold,
});
}
}
if (from === 'half-open' && to === 'open') {
// Recovery attempt failed — increase threshold
this.failureThreshold = Math.min(
this.failureThreshold + 1,
15 // Cap at 15
);
this.recoveryTimeout = Math.min(
this.recoveryTimeout * 1.5,
5 * 60_000
);
}
}
}The result: our circuit breakers' mean time to recover has decreased by 41% over three months because the thresholds have tuned themselves to the actual failure patterns of each service.
Health Checks That Write Themselves
One of our more unexpected innovations: health checks that generate new health checks based on observed failure patterns.
When the system encounters a failure mode it hasn't seen before, it adds a new health check to catch that failure earlier next time:
// After a memory leak crashed node-2 at 03:17
// The system generated this health check automatically:
{
"check_id": "auto-hc-2847",
"generated_from": "incident-2026-05-14-0317",
"description": "Memory growth rate check — detect leaks before OOM",
"target": "all_nodes",
"interval": "60s",
"check": {
"type": "rate_of_change",
"metric": "process.memory.rss",
"window": "10m",
"threshold": {
"warn": "50MB_per_10min", // Memory growing fast
"critical": "200MB_per_10min" // Almost certainly a leak
}
},
"action_on_warn": "increase_gc_frequency",
"action_on_critical": "drain_and_restart_node",
"auto_generated": true,
"confidence": 0.82,
"approved_by": "system", // Below risk threshold for human approval
"effective_since": "2026-05-14T03:45:00Z"
}We started with 12 hand-written health checks. The system has since generated 34 additional ones. Of those 34, we reviewed and kept 28 (82% useful). The 6 we removed were either redundant or had false positive rates above 10%.
The Recovery Budget: When Healing Costs Too Much
Self-healing sounds great until you realize it has a cost. Each retry consumes compute. Each code regeneration burns LLM tokens. Each task redirect delays downstream work. Without limits, a self-healing system can spend more resources recovering from failures than doing actual work.
We implement a recovery budget — a time and cost cap on how much the system will spend trying to fix a single failure:
interface RecoveryBudget {
maxAttempts: number; // Total attempts across all layers
maxTimeMs: number; // Wall-clock time limit
maxTokenCost: number; // LLM token budget for regeneration
maxCascadeDepth: number; // How many downstream tasks can be affected
// Dynamic adjustment based on task value
adjustedBudget(task: Task): RecoveryBudget {
const importance = task.priority / 10; // 0-1 scale
const downstream = task.dependents.length;
const multiplier = 1 + (importance * 0.5) + (downstream * 0.3);
return {
maxAttempts: Math.ceil(this.maxAttempts * multiplier),
maxTimeMs: Math.ceil(this.maxTimeMs * multiplier),
maxTokenCost: Math.ceil(this.maxTokenCost * multiplier),
maxCascadeDepth: this.maxCascadeDepth, // Safety limit — never adjust
};
}
}
// Default budgets by layer
const DEFAULT_BUDGETS: Record<string, RecoveryBudget> = {
L1: { maxAttempts: 5, maxTimeMs: 60_000, maxTokenCost: 0, maxCascadeDepth: 0 },
L2: { maxAttempts: 3, maxTimeMs: 120_000, maxTokenCost: 0, maxCascadeDepth: 2 },
L3: { maxAttempts: 3, maxTimeMs: 900_000, maxTokenCost: 50_000, maxCascadeDepth: 3 },
L4: { maxAttempts: 2, maxTimeMs: 3_600_000, maxTokenCost: 200_000, maxCascadeDepth: 5 },
};The recovery budget has saved us from runaway healing loops twice. In one case, a misconfigured test fixture caused every code generation to fail, and without the budget, the system would have burned through hundreds of dollars in LLM tokens retrying a fundamentally impossible task.
Production Numbers: Self-Healing in Practice
| Metric | Value | Trend |
|---|---|---|
| Failures auto-healed (last 30 days) | 847 | — |
| Healed at L1 (retry) | 62% | ↑ from 48% (better error classification) |
| Healed at L2 (redirect) | 18% | ↓ from 24% (fewer node failures) |
| Healed at L3 (rewrite) | 14% | Stable |
| Healed at L4 (restructure) | 4% | ↓ from 8% (better task specs) |
| Escalated to human (L5) | 2% | ↓ from 12% in month 1 |
| Mean time to recovery (auto) | 47 seconds | ↓ from 3.2 minutes |
| Recovery budget exceeded | 0.8% | Stable (correctly escalated) |
| Auto-generated health checks | 34 (28 kept) | 82% retention rate |
The most important number: human escalations dropped from 12% to 2% over six weeks. That's the self-healing system learning to handle edge cases that previously required us to wake up at 3 AM.
The Philosophical Shift
Building self-healing systems changed how we think about software reliability. The old model: prevent all failures. The new model: make failure cheap.
When recovery is fast and automatic, the calculus changes. You stop over-engineering prevention and start investing in detection and recovery. You stop writing defensive code that handles every edge case and start writing adaptive code that handles the common case and heals through the rest.
The most reliable systems aren't the ones that never fail. They're the ones that fail so gracefully you never notice.
In Part 3: Adaptive Algorithms, we'll explore how these recovery mechanisms feed into a system that improves its own performance over time — AI that literally improves AI.
All examples are from production systems running at Avyay as of May 2026. Build logs have been lightly edited for clarity. The self-healing framework is part of our build engine and will be open-sourced as a standalone library later this year.