6 Months of Autonomous Development: What We Actually Learned

Everyone’s talking about AI coding assistants. Copilot, Cursor, Windsurf — tools that autocomplete your code or suggest the next line. We did something different. We built a system where AI agents are the engineering team. Not assisting humans. Replacing the inner loop entirely.

For six months, our build engine has woken up every two hours, examined a task queue, spun up parallel agents across three machines, and shipped code — all while we slept, worked our day jobs, or weren’t paying attention at all.

The results aren’t what we expected. Some metrics look impossibly good. Others are humbling. And the lessons we learned don’t match any of the narratives currently dominating the AI-for-coding conversation.

What the Build Engine Actually Does

Before the metrics, you need to understand the system. Our build engine isn’t a CI/CD pipeline. It’s an autonomous development orchestrator.

Here’s the actual execution loop:

BUILD ENGINE CYCLE (every 2 hours, 6 AM – 10 PM)
─────────────────────────────────────────────────
1. Read build-queue.json → prioritized task list
2. Check node availability (Linux, MBP1, MBP3)
3. Match tasks to nodes based on constraints
4. Spawn parallel subagents (up to 3 simultaneous)
5. Each agent: read spec → write code → run tests → commit
6. On completion: update queue, report results
7. On failure: classify error → retry or escalate
8. Replenish queue with AI-generated next tasks

The key detail: step 8. After completing tasks, the engine doesn’t just stop. It analyzes what was built, identifies gaps, and generates new tasks. The queue is self-replenishing. We seed it with product requirements and architectural decisions. The engine decomposes those into buildable units and keeps the pipeline fed.

Three compute nodes form the infrastructure:

Linux (ThinkPad X1 Extreme) — primary build node, always on
MBP1 (MacBook Pro) — secondary node, macOS-specific builds
MBP3 (MacBook Pro) — tertiary node, overflow capacity

Connected via Tailscale mesh. No cloud servers. No Kubernetes. Just three laptops on a VPN.

The Numbers After 6 Months

Build Velocity

Metric	6-Month Total
Build cycles executed	2,847
Tasks attempted	2,847
Tasks completed successfully	1,926
Permanent failures	287
Retried → succeeded	634
First-attempt success rate	67.6%
Success rate with retries	89.9%

That first-attempt rate of 67.6% is the number that keeps us honest. One in three tasks fails on the first try. The retry mechanism — which classifies failures, adjusts approach, and re-attempts with different parameters — rescues another 22 percentage points. But nearly 10% of tasks still fail permanently.

What Failures Look Like

We categorized every permanent failure:

FAILURE TAXONOMY (287 permanent failures)
──────────────────────────────────────────
Subagent orphaning / timeout:      34%  (98)
Dependency resolution failures:    21%  (60)
Test failures (genuine bugs):      18%  (52)
SSH/network connectivity:          12%  (34)
Context window exhaustion:          9%  (26)
Ambiguous spec (task too vague):    6%  (17)

Subagent orphaningis our biggest problem. When a build agent loses its connection to the orchestrator — network blip, node sleep, process crash — the task enters a zombie state. We’ve built detection and cleanup, but 34% of permanent failures still trace back to this. It’s the unsolved problem of distributed autonomous systems: how do you reliably supervise agents that are designed to work without supervision?

Context window exhaustionis the most frustrating. A task that requires understanding a large codebase simply can’t fit enough context into a single agent session. The agent makes locally correct decisions that are globally wrong. We’ve seen it generate a beautiful authentication module that duplicates functionality already present three files away.

Cost Efficiency

Category	6-Month Total	Per Task
LLM API costs	$1,440	$0.75
Local compute (electricity)	$62	$0.03
Infrastructure (domains, etc.)	$24	—
Total	$1,526	$0.78

78 cents per completed development task. That includes tasks that generate 50-line config files and tasks that build entire monitoring dashboards with 15 panels. The distribution is heavily skewed — simple tasks cost under $0.10, complex multi-hour builds can cost $8-12 each.

The API cost breakdown by provider:

API COSTS BY PROVIDER (6 months)
─────────────────────────────────
Anthropic (Claude):    $892  (62%)  ← primary coding agent
OpenAI (GPT-4):        $412  (29%)  ← planning + review
Local models:           $74   (5%)  ← simple classification
Google (Gemini):        $62   (4%)  ← research tasks

Claude does the heavy lifting. We route to it for any task involving code generation, debugging, or architectural reasoning. GPT-4 handles planning and code review. Local models handle the simple stuff — file categorization, task priority scoring, queue management.

Five Counterintuitive Lessons

1. The 26% Retention Problem Isn’t a Problem

Over 6 months, our agents generated approximately 340,000 lines of code. Only 89,000 lines survive in production today. That’s a 26% retention rate — 74% of generated code was eventually deleted, refactored, or replaced.

This sounds wasteful. It’s not.

A human engineer writing 89,000 production lines in 6 months would be considered extraordinarily productive. They’d also iterate — writing draft implementations, refactoring, deleting dead code. The difference is that human iteration is invisible. Nobody tracks the lines a senior engineer types and then deletes before committing.

Our agents make iteration visible because it all happens in committed code. The 251,000 deleted lines represent the same exploratory process that happens in every engineering team — we just measure it.

What matters: the 89,000 surviving lines power 9 products with measured uptimes above 99.9%. The code works. The path to getting there was just louder than usual.

2. Night Builds Are Better Than Day Builds

This was genuinely surprising:

SUCCESS RATES BY TIME OF DAY
────────────────────────────
6 AM – 12 PM:    71.3%  first-attempt success
12 PM – 6 PM:    64.2%
6 PM – 12 AM:    69.8%
12 AM – 6 AM:    73.1%  ← highest

Night builds outperform daytime builds by almost 9 percentage points. We investigated three hypotheses:

API congestion: Provider latency is lower at night (Singapore time), reducing timeout failures. Confirmed — average API response time drops 23% between midnight and 6 AM.
Node contention: During the day, the machines run other workloads. Night builds get full CPU/memory. Confirmed — node resource utilization drops from 67% to 12% overnight.
Human interference: During the day, we occasionally SSH into build nodes, interrupt running tasks, or modify the queue. At night, the engine runs undisturbed. Confirmed — zero human interventions logged between midnight and 6 AM.

The lesson is uncomfortable: our build engine performs best when humans aren’t involved at all.The bottleneck isn’t the AI. It’s us.

3. Task Granularity Is Everything

Early on, we wrote tasks like: “Build a security scanning module for container images.” Success rate on tasks like this: 23%.

Now we write tasks like: “Add a Dockerfile parser that extracts FROM, RUN, and COPY instructions into a structured AST. Input: Dockerfile path. Output: JSON array of instruction objects with line numbers. Include tests for multi-stage builds and ARG interpolation.” Success rate on tasks like this: 84%.

TASK GRANULARITY vs. SUCCESS RATE
──────────────────────────────────
Vague / high-level:     23% success
Moderate specificity:   61% success
Highly specified:       84% success
Overspecified:          72% success  ← drops!

There’s a sweet spot. Too vague, and the agent makes wrong assumptions. Too specific, and you’ve essentially written the code in English — the agent follows your spec literally even when a better approach exists. Over-specification kills the agent’s ability to make good engineering decisions.

The best task specs describe what and why precisely, but leave howto the agent. “Build X that handles Y because Z” outperforms “Build X using approach A with data structure B and algorithm C.”

4. The 95% Failure Crisis Was Inevitable

Around build cycle 20, our success rate cratered to 5%. Ninety-five percent of tasks were failing. The build queue was full of orphaned, stuck, and erroring tasks. We seriously considered shutting the whole system down.

The root cause was cascading failures from a single architectural flaw: our subagent management had no timeout enforcement. When an agent hung — waiting for an API response that never came, stuck in an infinite retry loop, or blocked on a file lock — it held its node slot indefinitely. Other tasks queued behind it. When the cycle timer fired again, it spawned new agents alongside the stuck ones. Within 48 hours, we had dozens of zombie agents competing for resources on machines that could barely handle three.

The fix took two days:

Hard timeout: Every subagent gets a maximum execution time based on estimated task duration. Exceed it by 50% and you’re killed.
Health heartbeats: Agents must report progress every 5 minutes. Three missed heartbeats = automatic termination.
Node capacity limits: Each node declares a maximum concurrent agent count. The orchestrator respects it absolutely.
Queue hygiene: A daily sweep identifies tasks stuck in “running” state for more than 2× their estimated duration and resets them.

After these fixes, success rate climbed back to 70% within a week and eventually stabilized at the current 67.6% first-attempt rate.

The lesson: autonomous systems fail catastrophically, not gracefully.There’s no “slow degradation” when agents are involved. Everything works until it doesn’t, and then everything fails at once. You need circuit breakers, hard limits, and automated cleanup — not just for production services, but for the development pipeline itself.

5. Code Review Is the Human Bottleneck

WEEKLY OUTPUT (tasks completed)
───────────────────────────────
Engine capacity:       ~84 tasks/week (12 cycles × 7 days)
Engine actual output:  ~64 tasks/week (accounting for failures)
Human review capacity: ~20 tasks/week

The build engine can produce 64 completed tasks per week. We can meaningfully review about 20. That means roughly 70% of completed work goes into production with minimal human oversight — a quick scan of the diff, check that tests pass, merge.

This is the dirty secret of autonomous development: at scale, you can’t review everything. You have to trust the system or slow it down. We chose to trust it for low-risk tasks (documentation, config, tests, UI components) and review carefully for high-risk ones (security, data handling, API contracts).

Our review triage:

REVIEW LEVELS
─────────────────────────────────────────
Level 0 (auto-merge):   Tests pass, lint clean, diff < 100 lines
Level 1 (quick scan):   Tests pass, known patterns, no security surface
Level 2 (full review):  New modules, API changes, auth/crypto code
Level 3 (pair review):  Architecture changes, data model changes

About 40% of tasks qualify for Level 0 auto-merge. Another 30% get Level 1. Only 30% receive genuine human review. Is this reckless? Maybe. But in 6 months, we’ve had exactly two production incidents caused by under-reviewed code — both were caught and fixed within hours by the same build engine that created them.

What the Build Engine Produced

Six months of autonomous development built the following production systems:

MĀRGA— LLM routing and cost optimization. Routes 2M+ API calls per month across 5 providers. Reduced our LLM costs by 73%. Includes latency-weighted load balancing, semantic caching, and automatic failover.

RAKṢĀ— Security scanning platform. Detects CVEs in under 6 hours, scans container images, dependencies, and infrastructure-as-code. NIST CSF 2.0 compliance framework with cross-mapping to SOC2, ISO 27001, PCI-DSS, and GDPR.

DevOps RAG (KRIYĀ)— Incident response knowledge base. Ingests runbooks and operational docs, answers questions with source citations. Reduced mean time to resolution by 67% in internal use.

DHARMA— Automated incident triage. Classifies incoming alerts, routes to appropriate responders, auto-resolves known patterns. Currently achieving 30% auto-resolution rate.

ŚRUTI— Content intelligence engine. Ingests web content, builds searchable knowledge base with semantic search and content library management.

Content Engine— Autonomous blog pipeline. Researches topics, drafts posts, publishes to Notion, generates social media cutdowns. Produced 26 blog posts in 6 months.

Knowledge Graph— 39,000+ entity knowledge base powering memory, decision tracking, and context across all systems.

Plus dashboards, monitoring, integration tests, documentation, and infrastructure automation. All from three laptops and $1,526 in total spend.

The Honest Assessment

What Works

Parallelism is the killer feature.Three nodes building simultaneously means 3× throughput with zero coordination overhead. The agents don’t need standup meetings. They don’t have merge conflicts (usually). They don’t get blocked waiting for code review. Parallel autonomous development is genuinely faster than sequential human development for well-specified tasks.

Continuous operation changes the math. Running 6 AM to 10 PM (we recently expanded to include night builds) means the engine works roughly 16 hours per day. A human engineering team works 6-8 productive hours. Even with a lower success rate per task, the volume advantage is decisive.

Cost structure is unbeatable. $0.78 per completed task. Even accounting for our time (15-20 hours per week of oversight, spec writing, and review), the total cost is a fraction of a traditional team.

What Doesn’t Work

Complex architectural reasoning.Tasks requiring understanding of system-wide implications — “refactor the authentication flow to support multi-tenant isolation” — fail more often than they succeed. The agent can write excellent code for a well-defined module. It can’t reason reliably about how that module interacts with twelve others it hasn’t seen.

Debugging existing code. When something breaks in production, agents are mediocre debuggers. They can identify obvious errors (null reference, missing import). They struggle with subtle issues (race condition, incorrect caching logic, edge case in date handling). We still debug production issues manually most of the time.

Quality vs. speed trade-off.The code works, but it’s not elegant. A senior engineer would write tighter abstractions, better error messages, more thoughtful test coverage. Our agents produce code that passes tests and ships features, but accumulates technical debt faster than a human team would. After 6 months, we’re starting to feel the weight of that debt.

What We’d Tell You If You’re Considering This

Start with the task spec, not the agent. The quality of autonomous output is 80% determined by the quality of the input spec. Invest in writing good task descriptions. Be specific about inputs, outputs, constraints, and success criteria. Leave implementation approach open.

Build the supervision layer first. Timeouts, heartbeats, capacity limits, queue hygiene, failure classification. Before you run your first autonomous build cycle, build all the infrastructure to monitor and recover from failures. You will need every piece of it.

Accept the retention rate.Not all generated code will survive. That’s fine. You’re trading precision for volume and speed. The math works out if your feedback loop is tight enough to catch problems quickly.

Don’t try to review everything.You can’t, and you shouldn’t. Build trust in the system incrementally. Start with auto-merging tests and docs. Expand to config files. Eventually to UI components. Keep human review for security boundaries and architectural decisions.

The real cost is your attention.The infrastructure is cheap. The API calls are cheap. Your time deciding what to build, how to specify it, and which outputs to trust — that’s the expensive part. Autonomous development doesn’t eliminate engineering judgment. It amplifies it.

What’s Next

Six months in, our build engine is producing more code than we can review and more features than we can market. The bottleneck has shifted from “can we build it?” to “should we build it?” and “can we sell it?”

That’s probably the most important lesson of autonomous development: it doesn’t solve the hard problems. Product-market fit, customer acquisition, pricing strategy — those are still human problems. What it does is remove the excuse of “we don’t have enough engineering capacity” from a two-person startup’s vocabulary.

The engine keeps running. The queue keeps filling. The code keeps shipping. And we keep learning that the future of software development looks less like a coder with a copilot and more like an engineer with a factory.

Avyay (अव्यय) builds autonomous AI systems for enterprises. Our products — MĀRGA, RAKṢĀ, KRIYĀ, DHARMA, and ŚRUTI — are built almost entirely by the autonomous development pipeline described in this post.