- Autonomous Software Architecture — Beyond Traditional Programming
- Self-Healing Systems — When Code Fixes Itself
- Adaptive Algorithms — AI That Improves AI (You are here)
- Scaling Autonomous Systems — Lessons from 300+ Auto-Builds
Most AI systems are static. You train a model, deploy it, and it performs at exactly the same level until someone trains a new version. The model doesn't learn from the requests it processes. It doesn't adapt to changing conditions. It doesn't get better at the specific workload it's handling. It's frozen at deployment time.
This is a profound waste. Every request your system handles is a learning opportunity. Every success tells you what works. Every failure tells you what doesn't. But traditional architectures throw this information away — they process the request and move on.
At Avyay, we build systems that capture every signal and use it to improve in real time. MĀRGA's LLM router gets smarter with every API call. Our build engine generates better tasks every week. RAKṢĀ's scanner learns which code patterns are actually dangerous vs. merely suspicious.
This isn't machine learning in the traditional sense — we're not training neural networks on production data. It's something more practical: algorithmic adaptation through feedback signals. Systems that tune their own parameters, adjust their own strategies, and rewrite their own heuristics based on observed outcomes.
Case Study: MĀRGA's Self-Optimizing Router
MĀRGA routes LLM requests across multiple providers (OpenAI, Anthropic, Google, open-source models) to optimize for cost, latency, and quality. The naive approach: write a static routing table that maps request types to providers. The problem: provider performance changes constantly. OpenAI's latency spikes during peak hours. Anthropic's quality varies by prompt type. Open-source models excel at some tasks and fail at others.
MĀRGA's router uses a multi-armed bandit algorithm with contextual features to continuously learn the best provider for each request type:
// MĀRGA's adaptive routing — simplified from production
class AdaptiveRouter {
// Provider performance model — updated with every request
private providerModels: Map<string, ProviderModel> = new Map();
// Exploration rate: how often we try non-optimal providers to learn
private epsilon = 0.15; // 15% exploration, 85% exploitation
async route(request: LLMRequest): Promise<RoutingDecision> {
const features = this.extractFeatures(request);
// features: {taskType, complexity, tokenBudget, latencyBudget, qualityMin}
if (Math.random() < this.epsilon) {
// EXPLORE: Try a random provider to gather data
const provider = this.randomEligibleProvider(features);
return { provider, reason: 'exploration', confidence: 0 };
}
// EXPLOIT: Pick the best provider based on learned model
const scores = this.scoreProviders(features);
// scores = [
// { provider: 'openai/gpt-4o', score: 0.87, breakdown: {...} },
// { provider: 'anthropic/claude-sonnet', score: 0.92, breakdown: {...} },
// { provider: 'google/gemini-pro', score: 0.71, breakdown: {...} },
// ]
const best = scores.sort((a, b) => b.score - a.score)[0];
return {
provider: best.provider,
reason: 'exploitation',
confidence: best.score,
alternatives: scores.slice(1, 3), // For fallback
};
}
private scoreProviders(features: RequestFeatures): ProviderScore[] {
return Array.from(this.providerModels.entries()).map(([provider, model]) => {
// Each dimension is scored 0-1 based on historical performance
const latencyScore = model.predictLatency(features) < features.latencyBudget
? 1.0 - (model.predictLatency(features) / features.latencyBudget)
: 0; // Over budget = zero score
const qualityScore = model.predictQuality(features);
const costScore = 1.0 - (model.predictCost(features) / this.maxCost);
const reliabilityScore = model.getRecentUptime('1h');
// Weighted combination — weights are ALSO adaptive
const weights = this.currentWeights(features);
// Default: {latency: 0.2, quality: 0.4, cost: 0.3, reliability: 0.1}
// But shifts during incidents (reliability weight increases)
// Or during cost pressure (cost weight increases)
return {
provider,
score: (
latencyScore * weights.latency +
qualityScore * weights.quality +
costScore * weights.cost +
reliabilityScore * weights.reliability
),
breakdown: { latencyScore, qualityScore, costScore, reliabilityScore },
};
});
}
}But the routing is only half the story. The real magic is in how the model updates itself after every request:
// After every LLM request completes, MĀRGA records the outcome
async onRequestComplete(result: RequestResult) {
const model = this.providerModels.get(result.provider);
// Update latency model (exponential moving average)
model.latencyEMA.update(result.latencyMs, {
taskType: result.features.taskType,
tokenCount: result.tokenCount,
timeOfDay: new Date().getHours(),
});
// Update quality model (based on downstream success)
// Quality isn't just "did the API return 200?" — it's
// "did the generated output actually work?"
if (result.downstreamOutcome) {
model.qualityEMA.update(
result.downstreamOutcome.score,
result.features
);
}
// Update cost model
model.costModel.update({
inputTokens: result.inputTokens,
outputTokens: result.outputTokens,
actualCost: result.cost,
});
// Adaptive exploration rate
// More data → less exploration needed
const totalSamples = model.totalRequests;
this.epsilon = Math.max(
0.05, // Minimum 5% exploration (never stop learning)
0.3 * Math.exp(-totalSamples / 1000) // Decay from 30% to 5%
);
// Detect regime change — if recent performance deviates significantly
// from the model, increase exploration to re-learn
if (model.recentDeviation('1h') > 2.0) { // 2 std deviations
this.epsilon = Math.min(this.epsilon * 2, 0.4);
this.emit('regime_change_detected', {
provider: result.provider,
deviation: model.recentDeviation('1h'),
newEpsilon: this.epsilon,
});
}
}The regime change detection is critical. When a provider suddenly changes behavior — maybe OpenAI deploys a new model version, or Anthropic changes their rate limits — the system detects the statistical anomaly and automatically increases exploration to re-learn the new landscape. No human needs to notice and update a config file.
MĀRGA's Adaptation in Numbers
| Metric | Day 1 (Static Config) | Week 4 (Adapted) | Improvement |
|---|---|---|---|
| Cost per 1K requests | $4.20 | $0.79 | -81% |
| p50 latency | 1,240ms | 340ms | -73% |
| p99 latency | 8,400ms | 2,100ms | -75% |
| Quality score (avg) | 0.82 | 0.91 | +11% |
| Requests to cheapest-adequate provider | 34% | 78% | +129% |
| Regime changes detected | — | 7 | All auto-adapted |
The "requests to cheapest-adequate provider" metric is the clearest signal of adaptation. On day one, we were over-routing to expensive models. By week four, the system learned that 78% of requests could be handled by cheaper models without quality loss. No human tuned this. The algorithm found it.
The Build Engine's Learning Loop
MĀRGA adapts how it routes. Our build engine adapts how it generates work. Every completed task — success or failure — teaches the engine to write better task specifications, estimate durations more accurately, and decompose problems more effectively.
Prompt Evolution
When the build engine generates a task for an AI coding agent, it writes a prompt. The quality of that prompt directly determines the quality of the output. Our prompts evolve based on results:
// Prompt template with adaptive sections
class PromptEvolver {
private templates: Map<string, PromptTemplate> = new Map();
private successPatterns: Pattern[] = [];
private failurePatterns: Pattern[] = [];
async generatePrompt(task: TaskSpec): Promise<string> {
const template = this.templates.get(task.type) || this.defaultTemplate;
// Inject lessons learned from similar past tasks
const relevantLessons = this.findRelevantLessons(task, 5);
// Example lessons:
// - "When generating API handlers, always include error response types"
// - "PostgreSQL migrations must include both up and down directions"
// - "Test files should import from the package, not relative paths"
// Inject anti-patterns to avoid
const antiPatterns = this.findRelevantAntiPatterns(task, 3);
// Example: "Do NOT use synchronous file I/O in request handlers"
// Adjust specificity based on task type's historical success rate
const successRate = this.getSuccessRate(task.type, '7d');
const specificityLevel = successRate < 0.6
? 'very_specific' // Low success → be extremely precise
: successRate < 0.8
? 'specific' // Medium success → standard detail
: 'concise'; // High success → trust the agent, be brief
return template.render({
task,
lessons: relevantLessons,
antiPatterns,
specificityLevel,
exampleOutput: successRate < 0.7
? this.findBestExample(task.type) // Include example for struggling types
: null,
});
}
// After task completion, update the knowledge base
async onTaskComplete(task: Task, result: TaskResult) {
if (result.success && result.qualityScore > 0.85) {
// High-quality success — extract what worked
this.successPatterns.push({
taskType: task.type,
promptFeatures: this.extractPromptFeatures(task.prompt),
qualityScore: result.qualityScore,
timestamp: Date.now(),
});
// Was this type previously struggling? If success rate just crossed 0.8,
// the prompt evolution is working — record which changes helped
const previousRate = this.getSuccessRate(task.type, '14d');
const recentRate = this.getSuccessRate(task.type, '3d');
if (previousRate < 0.7 && recentRate > 0.8) {
this.recordBreakthrough(task.type, {
whatChanged: this.diffPromptTemplates(task.type, '14d'),
improvement: recentRate - previousRate,
});
}
}
if (!result.success) {
// Failure — extract what went wrong and create a lesson
const lesson = await this.analyzeFailure(task, result);
this.failurePatterns.push({
taskType: task.type,
failureMode: lesson.mode,
lesson: lesson.text,
timestamp: Date.now(),
});
}
}
}The specificity adaptation is subtle but powerful. When the system notices a task type has a low success rate, it automatically makes prompts more detailed — including examples, anti-patterns, and explicit constraints. As the success rate improves (because the more detailed prompts work), it gradually reduces specificity to save tokens. It's a closed-loop optimization on prompt engineering itself.
Duration Estimation
Accurate duration estimation matters for scheduling. If we overestimate, nodes sit idle. If we underestimate, the critical path lengthens because downstream tasks are dispatched too early.
Our duration estimator started with rough heuristics. It now predicts task duration with ±18% accuracy (down from ±45% in week one):
// Duration estimation — adaptive regression
class DurationEstimator {
private history: CompletedTask[] = [];
private featureWeights: Map<string, number> = new Map([
['base', 600], // 10 minutes base
['complexity', 120], // +2 min per complexity unit
['fileCount', 30], // +30s per file touched
['testCount', 45], // +45s per test expected
['dependencyCount', 60], // +1 min per dependency
['hasDatabase', 300], // +5 min if database changes
['isNewFile', 180], // +3 min for new file creation
]);
estimate(task: TaskSpec): DurationEstimate {
const features = this.extractFeatures(task);
let estimate = this.featureWeights.get('base')!;
for (const [feature, value] of Object.entries(features)) {
const weight = this.featureWeights.get(feature) || 0;
estimate += weight * (typeof value === 'boolean' ? (value ? 1 : 0) : value);
}
// Apply node-specific modifier (some nodes are faster)
const nodeModifier = this.getNodeSpeedModifier(task.targetNode);
estimate *= nodeModifier;
// Calculate confidence interval from prediction error history
const recentErrors = this.getRecentErrors(task.type, 20);
const stdDev = this.standardDeviation(recentErrors);
return {
expected: Math.round(estimate),
low: Math.round(estimate - 1.5 * stdDev),
high: Math.round(estimate + 1.5 * stdDev),
confidence: 1 / (1 + stdDev / estimate),
};
}
// Update weights after task completion using gradient-free optimization
onComplete(task: Task, actualDuration: number) {
const predicted = this.estimate(task);
const error = actualDuration - predicted.expected;
const features = this.extractFeatures(task);
// Simple online gradient descent on feature weights
const learningRate = 0.01;
for (const [feature, value] of Object.entries(features)) {
const numValue = typeof value === 'boolean' ? (value ? 1 : 0) : value;
if (numValue === 0) continue;
const currentWeight = this.featureWeights.get(feature) || 0;
const gradient = error * numValue;
const newWeight = currentWeight + learningRate * gradient;
// Clamp to prevent wild swings
this.featureWeights.set(feature, Math.max(0, Math.min(newWeight, 3600)));
}
this.history.push({ ...task, actualDuration, predictedDuration: predicted.expected });
}
}
// Week 1 vs Week 6 comparison:
// Feature | Week 1 Weight | Week 6 Weight | Δ
// base | 600s | 420s | Tasks are faster than expected
// complexity | 120s | 180s | Complexity matters more than we thought
// hasDatabase | 300s | 540s | DB tasks take way longer
// testCount | 45s | 25s | Tests are faster than assumed
// isNewFile | 180s | 90s | Agent is good at new filesRAKṢĀ's Learning Scanner
RAKṢĀ scans code for security vulnerabilities. The challenge: balancing sensitivity (catching real issues) with specificity (not drowning developers in false positives). Traditional scanners use static rule sets. RAKṢĀ's rules adapt based on outcomes.
// RAKṢĀ's adaptive rule scoring
class AdaptiveScanner {
private ruleScores: Map<string, RuleScore> = new Map();
async scan(code: string, context: ScanContext): Promise<Finding[]> {
const rawFindings = await this.runAllRules(code, context);
// Score each finding by the rule's historical accuracy
return rawFindings
.map(f => ({
...f,
adjustedSeverity: this.adjustSeverity(f),
falsePositiveProbability: this.estimateFPRate(f.ruleId, context),
}))
.filter(f => f.falsePositiveProbability < 0.7) // Drop likely FPs
.sort((a, b) => b.adjustedSeverity - a.adjustedSeverity);
}
private adjustSeverity(finding: Finding): number {
const score = this.ruleScores.get(finding.ruleId);
if (!score) return finding.severity; // No history — use default
// Adjust severity based on true positive rate
// Rules with high TP rate get boosted; low TP rate get demoted
return finding.severity * score.truePositiveRate;
}
// Developer marks finding as true/false positive
async onFeedback(findingId: string, isTruePositive: boolean) {
const finding = await this.getFinding(findingId);
const score = this.ruleScores.get(finding.ruleId) || {
truePositives: 0,
falsePositives: 0,
truePositiveRate: 0.5, // Start neutral
};
if (isTruePositive) {
score.truePositives++;
} else {
score.falsePositives++;
}
score.truePositiveRate = score.truePositives /
(score.truePositives + score.falsePositives);
this.ruleScores.set(finding.ruleId, score);
// If a rule's TP rate drops below 20%, propose rule deprecation
if (score.truePositiveRate < 0.2 &&
(score.truePositives + score.falsePositives) > 20) {
this.emit('rule_deprecation_proposed', {
ruleId: finding.ruleId,
tpRate: score.truePositiveRate,
sampleSize: score.truePositives + score.falsePositives,
});
}
}
}The results: RAKṢĀ's false positive rate dropped from 34% to 12% over its first month of adaptive scanning. It deprecated 3 rules that were generating noise and boosted 7 rules that were consistently finding real issues.
The Meta-Adaptation Layer
Here's where it gets recursive. The adaptation parameters themselves are adaptive. The learning rate, exploration rate, feedback decay — all of these are meta-parameters that the system tunes based on how well the adaptation is working.
// Meta-adaptation: tuning the tuners
class MetaAdaptation {
// Track how well each adaptation mechanism is performing
private adaptationEffectiveness: Map<string, EffectivenessMetric> = new Map();
async evaluateAndAdjust() {
// Run daily — assess whether adaptations are actually helping
for (const [mechanism, metric] of this.adaptationEffectiveness) {
const improvement = metric.recentImprovement('7d');
const previousImprovement = metric.recentImprovement('14d', '7d');
if (improvement > previousImprovement * 1.1) {
// Adaptation is accelerating — the current learning rate is good
// Maybe even increase it slightly
this.adjustLearningRate(mechanism, 1.05);
} else if (improvement < previousImprovement * 0.5) {
// Adaptation is stalling — try increasing exploration
this.adjustExplorationRate(mechanism, 1.2);
} else if (improvement < 0) {
// Adaptation is making things worse — reduce learning rate
this.adjustLearningRate(mechanism, 0.8);
// Also log for human review — something unusual is happening
this.alert('adaptation_regression', mechanism, improvement);
}
}
}
}
// Meta-metrics after 6 weeks:
// ┌─────────────────────────┬──────────────┬──────────────┐
// │ Adaptation Mechanism │ Learning Rate│ Effectiveness│
// ├─────────────────────────┼──────────────┼──────────────┤
// │ MĀRGA routing weights │ 0.008 → 0.012│ +81% cost ↓ │
// │ Duration estimation │ 0.01 → 0.007 │ ±18% accuracy│
// │ Prompt specificity │ 0.05 → 0.03 │ +18% success │
// │ RAKṢĀ rule scores │ 0.1 → 0.15 │ -22pp FP rate│
// │ Node capability scores │ 0.02 → 0.025 │ +12% util. │
// └─────────────────────────┴──────────────┴──────────────┘The Explore-Exploit Dilemma in Practice
Every adaptive system faces the same fundamental tension: exploit what you know works, or explore alternatives that might work better. Pure exploitation is fragile — you stop learning and miss improvements. Pure exploration is wasteful — you spend resources on experiments instead of work.
We solve this differently in each system:
| System | Exploration Strategy | Exploration Budget | Rationale |
|---|---|---|---|
| MĀRGA Router | ε-greedy with decay | 5-15% of requests | Every request is cheap to experiment with |
| Build Engine Tasks | Thompson sampling | ~10% of task generations | Tasks are expensive; sample from probability distributions |
| RAKṢĀ Rules | UCB1 (upper confidence bound) | New rules start with exploration bonus | Favor rules with uncertain performance (need data) |
| Node Scheduling | Periodic forced exploration | 1 task per node per hour | Nodes change over time; keep estimates fresh |
Thompson sampling for the build engine was a breakthrough. Instead of deterministically picking the best-known approach, we sample from the probability distribution of each approach's success. Approaches with uncertain estimates (wide distributions) naturally get explored more, while approaches with confident high estimates get exploited. It's mathematically optimal and empirically effective.
Avoiding Adaptation Pitfalls
Building adaptive systems is full of traps. We fell into several:
Pitfall 1: Adaptation Oscillation
Early on, MĀRGA would oscillate between two providers. It would learn that Provider A is best, route everything there, overwhelm Provider A's rate limits, observe degradation, switch to Provider B, repeat. The fix: incorporate load as a feature in the routing decision, not just observed performance.
Pitfall 2: Feedback Delay
Some feedback signals arrive late. A code generation might pass all tests immediately but cause a subtle bug that surfaces days later. We handle this with provisional scores— initial quality assessments that get updated when late feedback arrives. A task's quality score isn't final for 7 days.
Pitfall 3: Concept Drift
The world changes. A prompt strategy that worked in April might not work in May because the underlying LLM models have been updated. We use sliding windows for all adaptation — only the last N data points (or N days) influence the model. Old data is discounted, not discarded, so the system can detect when old patterns re-emerge.
An adaptive system that doesn't forget is just a system that averages everything and learns nothing.
The Compounding Effect
The real power of adaptive algorithms isn't any single optimization. It's the compounding. MĀRGA routes better → build engine gets better code → better code produces better feedback → feedback improves MĀRGA's quality estimates → MĀRGA routes even better. Each improvement amplifies the others.
Over six weeks, the compounding effect has produced results that no single optimization could achieve:
- LLM costs: Down 81% (from $4.20 to $0.79 per 1K requests)
- First-attempt task success: Up from 54% to 72%
- Median build time: Down from 24 minutes to 14 minutes
- False positive security findings: Down from 34% to 12%
- Human interventions: Down from 12 per week to 3.4
None of these improvements required a code change. The system improved itself, using the architecture we built and the data it generated. This is what "AI that improves AI" actually looks like — not some science fiction AGI, but well-designed feedback loops running at machine speed.
In the final part of this series, Scaling Autonomous Systems, we'll cover what happens when you need to take these adaptive systems from 3 nodes to 30 — the operational challenges, monitoring strategies, and hard-won lessons from our 300+ autonomous builds.
All performance numbers are from production systems at Avyay as of May 2026. MĀRGA's routing algorithm is proprietary, but we plan to publish a paper on the adaptive routing approach later this year. Code examples are simplified from production for clarity.