The standard startup build cycle looks like this: a founder or engineer wakes up, opens their laptop, picks a task from a Kanban board, works on it for a few hours, pushes code, and closes their laptop. Progress happens during business hours. Nights and weekends are dead time. If your team is in one timezone, you’re shipping during maybe 40 of the 168 hours in a week. That’s a 24% duty cycle.
We decided that was unacceptable.
At Avyay, we run an autonomous build engine — a system that continuously generates, schedules, and executes development tasks across multiple Mac nodes, 24 hours a day, 7 days a week. It doesn’t wait for standup. It doesn’t need sprint planning. It doesn’t stop building because it’s 2 AM in Singapore.
This isn’t a CI/CD pipeline. CI/CD triggers on human commits. Our system triggers on the absence of work — when the queue empties, the engine generates new tasks, prioritizes them by dependency graph, and dispatches them to whichever node is free.
The result: a two-person company that ships like a twenty-person team.
Why Continuous Pipelines Beat Wave-Based Builds
Most engineering organizations operate in waves. Two-week sprints. Weekly deployments. Quarterly planning cycles. The wave model assumes that humans need to batch work for coordination — you plan together, build together, review together, ship together.
AI agents don’t have this constraint. They don’t need to sync their calendars. They don’t lose context between sessions. They don’t need a standup to know what happened yesterday — they read the state file.
Wave-based builds have three structural problems that continuous pipelines eliminate:
1. Queue starvation between waves. In a sprint model, the first two days are planning and ramp-up. The last two days are stabilization and review. That leaves maybe 60% of the sprint for actual building. A continuous pipeline has zero ramp-up time — the moment a task completes, the next one starts.
2. Dependency serialization.In sprint planning, you identify dependencies and sequence them within the sprint. But what if task C depends on task A, which depends on a research spike that might take 20 minutes or 4 hours? Sprint planning can’t handle this uncertainty, so teams either over-allocate buffer time or hit blockers mid-sprint. A continuous pipeline resolves dependencies dynamically — when A finishes, C becomes eligible, and the scheduler dispatches it immediately.
3. Human context-switching costs. An engineer switching between tasks loses 15-25 minutes of productive time per switch. Over a sprint, this adds up to hours of wasted cognitive energy. AI agents have zero context-switching cost — they load the full task context from disk, execute, and move on.
The continuous model treats the build process like a production system, not a project. Production systems don’t take breaks. They don’t have sprints. They process work as it arrives, 24/7.
The Architecture: Orchestrator, Queue, and Nodes
The build engine has three components: a central orchestrator, a persistent task queue, and a fleet of execution nodes. Here’s how they fit together.
The Orchestrator
The orchestrator is the brain. It runs on a Linux server (a ThinkPad X1 Extreme, because we’re scrappy) and manages the entire lifecycle of every task — from generation through execution to verification.
class BuildOrchestrator:
def __init__(self, queue: TaskQueue, nodes: List[ExecutionNode]):
self.queue = queue
self.nodes = nodes
self.dependency_graph = DependencyGraph()
self.task_generator = AutonomousTaskGenerator()
async def run_forever(self):
"""The main loop. It literally never stops."""
while True:
# Phase 1: Check if queue needs replenishing
if self.queue.eligible_count() < len(self.nodes):
new_tasks = await self.task_generator.generate(
context=self.get_project_state(),
completed=self.queue.completed_tasks(),
in_progress=self.queue.active_tasks()
)
self.queue.enqueue(new_tasks)
# Phase 2: Dispatch eligible tasks to free nodes
for node in self.nodes:
if node.is_idle():
task = self.queue.next_eligible(
node_capabilities=node.capabilities,
dependency_graph=self.dependency_graph
)
if task:
await node.dispatch(task)
# Phase 3: Collect results from completed tasks
for node in self.nodes:
if node.has_result():
result = await node.collect()
self.queue.mark_complete(result.task_id, result)
self.dependency_graph.resolve(result.task_id)
await asyncio.sleep(30) # Poll every 30 secondsThree things to notice about this loop:
First, it’s pull-based, not push-based. Nodes don’t receive tasks — they become available, and the orchestrator assigns work. This is critical for fault tolerance. If a node crashes, its task returns to the queue. No work is lost.
Second, the queue replenishment is autonomous. When the queue runs low, the task generator creates new work based on the current state of the project. It reads the codebase, identifies gaps, and produces concrete, executable tasks. More on this below.
Third, the dependency graph is dynamic. Tasks declare their dependencies at creation time, but the graph updates as tasks complete. A task that was blocked at 10 PM might become eligible at 10:05 PM when its dependency finishes on another node.
The Task Queue
The task queue is a JSON file. Not Redis. Not Kafka. Not a Postgres table. A JSON file.
{
"tasks": {
"marketing-agent-research": {
"status": "done",
"completedAt": "2026-05-11T06:05:00+08:00",
"title": "Marketing Agent — Research & Architecture",
"deliverable": "plans/marketing-agent-spec.md",
"sections": [
"Channel Research & Analysis (7 channels, 3 tiers)",
"Viral Launch Playbooks (PH, HN, Twitter/X, LinkedIn, Reddit, IH)",
"Content Format Templates (7 platform-specific templates)",
"Optimal Posting Schedules (weekly + launch day)",
"Lead Capture & Attribution (UTM strategy, scoring model)",
"Reusable Marketing Pipeline Architecture (4-stage pipeline)",
"Autonomous Agent System Design (7 agents + orchestrator)",
"Implementation Roadmap (6 phases)"
],
"wordCount": 6500,
"notes": "Comprehensive spec covering channels, playbooks, templates..."
},
"blog-article-3": {
"status": "done",
"completedAt": "2026-05-10T19:17:00+08:00",
"title": "Your RAG Pipeline Is Just a Search Engine with Extra Steps",
"slug": "rag-is-not-search",
"url": "https://avyay.ai/blog/rag-is-not-search",
"product": "Second Brain (VIDYĀ)",
"wordCount": 3200,
"readTime": "18 min"
}
}
}Why a JSON file? Three reasons:
Debuggability.When something goes wrong, you open the file and read it. No query language. No connection strings. No “what’s in the queue?” mystery. The entire state of the build engine is human-readable at all times.
Atomic commits. The queue file lives in a git repository. Every state change — task added, task started, task completed — is a git commit. You have full history. You can rewind. You can branch the queue for experiments. You can diff two queue states to see exactly what changed.
Simplicity.We’re running 5-10 concurrent tasks across 2-3 nodes. At this scale, a JSON file is faster than any database query. The orchestrator reads the file, updates it, and writes it back. The whole operation takes less than a millisecond.
This won’t scale to 10,000 tasks. We don’t need it to. The build engine generates tasks just-in-time — the queue rarely has more than 20 items. If you’re building a system that needs to scale to 10,000 concurrent tasks, you have different problems and different resources. Don’t over-engineer for hypothetical load.
The Execution Nodes
Each node is a Mac — MacBook Pros running macOS, connected to the orchestrator over a secure tunnel (Tailscale, in our case). Why Macs?
1. Xcode builds require macOS.If you’re building iOS apps, you need Macs. There’s no workaround. Apple’s license agreement prohibits running macOS in VMs on non-Apple hardware, and cross-compilation for iOS is a dead end for anything non-trivial.
2. Apple Silicon is absurdly efficient. An M-series MacBook Pro draws 20-30W under sustained load. A comparable cloud instance costs $0.50-2.00/hour. Our nodes run 24/7 at home, on residential power. The economics are hard to beat.
3. AI workloads love unified memory.Apple Silicon’s unified memory architecture means LLM inference on a 32GB MacBook Pro can load models that would require a GPU on x86. Since our build agents use LLMs for code generation and analysis, this matters.
Each node runs a lightweight agent that accepts tasks, executes them in an isolated environment, and reports results:
class ExecutionNode:
def __init__(self, node_id: str, capabilities: NodeCapabilities):
self.node_id = node_id
self.capabilities = capabilities
self.current_task = None
async def execute(self, task: BuildTask) -> TaskResult:
"""Execute a task in isolation."""
self.current_task = task
# Create isolated workspace
workspace = await self.create_workspace(task)
# Load task-specific context
context = await self.load_context(
task.dependencies_output,
task.reference_files,
task.instructions
)
# Execute via AI agent
result = await self.agent.run(
system_prompt=task.to_system_prompt(),
context=context,
workspace=workspace,
timeout=task.estimated_duration * 2 # 2x buffer
)
# Verify deliverables exist
for deliverable in task.expected_deliverables:
if not workspace.exists(deliverable):
return TaskResult(
status="failed",
error=f"Missing deliverable: {deliverable}"
)
# Quality check
quality = await self.verify_quality(result, task)
if quality.score < task.min_quality_threshold:
return TaskResult(
status="retry",
quality_report=quality
)
self.current_task = None
return TaskResult(status="done", output=result)Notice the quality verification step. The node doesn’t just execute — it verifies. Every task has expected deliverables and a minimum quality threshold. If the output doesn’t meet the threshold, the task goes back to the queue with a quality report, and the next attempt uses that report as additional context. Tasks get better on retry, not worse.
Dependency Resolution: The Hard Part
Task scheduling is easy when tasks are independent. Run them all in parallel, collect results, done. The hard part is when tasks depend on each other — and in real-world product development, almost everything depends on something.
Our dependency graph handles four types of relationships:
Hard Dependencies
Task B cannot start until Task A produces a specific artifact. Example: “Build the API routes” cannot start until “Design the API schema” produces api-schema.yaml.
{
"task": "build-api-routes",
"depends_on": {
"design-api-schema": {
"type": "hard",
"requires_artifact": "api-schema.yaml"
}
}
}Hard dependencies are binary — either the artifact exists or it doesn’t. The scheduler checks the filesystem, not the task status. This is important: if the dependency task failed but somehow produced the artifact (partial completion), the dependent task can still proceed. We care about artifacts, not ceremonies.
Soft Dependencies
Task B benefits from Task A’s output but can proceed without it. Example: “Write user documentation” benefits from “Build the UI” (so screenshots are available), but can proceed with placeholder descriptions.
{
"task": "write-user-docs",
"depends_on": {
"build-ui": {
"type": "soft",
"timeout": "2h",
"fallback": "proceed_with_placeholders"
}
}
}Soft dependencies have a timeout. If the dependency doesn’t resolve within the timeout, the dependent task starts anyway with a fallback strategy. This prevents pipeline stalls from non-critical dependencies.
Resource Dependencies
Two tasks can’t run simultaneously because they need the same resource. Example: two tasks both need to modify the same configuration file, or both need exclusive access to a test database.
class ResourceManager:
def __init__(self):
self.locks = {} # resource_id -> task_id
def can_schedule(self, task: BuildTask) -> bool:
for resource in task.exclusive_resources:
if resource in self.locks:
return False
return True
def acquire(self, task: BuildTask):
for resource in task.exclusive_resources:
self.locks[resource] = task.task_id
def release(self, task: BuildTask):
for resource in task.exclusive_resources:
if self.locks.get(resource) == task.task_id:
del self.locks[resource]Resource dependencies are the sneakiest source of bugs. Two AI agents trying to modify the same file simultaneously will produce conflicts that are expensive to resolve. The resource manager prevents this by serializing access to shared resources.
Ordering Dependencies
Task B should run after Task A for logical coherence, but there’s no data dependency. Example: “Write the blog post about Feature X” should run after “Build Feature X” — not because the blog post needs any artifact from the build, but because writing about a feature you haven’t built yet produces vaporware.
These four dependency types cover 95% of real-world scheduling needs. The remaining 5% are genuinely weird edge cases that we handle by letting the task generator add explicit notes in the task description.
Autonomous Task Generation: The Engine Within the Engine
This is the part that makes the system truly autonomous. When the queue runs low, the task generator creates new tasks. Not from a backlog. Not from Jira tickets. From analysis of the current project state.
Here’s how it works:
class AutonomousTaskGenerator:
async def generate(self, context: ProjectState) -> List[BuildTask]:
"""Generate new tasks based on project state analysis."""
# 1. Scan codebase for gaps
gaps = await self.analyze_codebase(context.repo_path)
# Missing tests, incomplete docs, TODO comments,
# unused imports, dead code, missing error handling
# 2. Check product roadmap
roadmap_items = await self.parse_roadmap(
context.roadmap_file
)
next_milestone = roadmap_items.next_incomplete()
# 3. Review recent completions for follow-up work
recent = context.completed_tasks[-10:]
followups = await self.identify_followups(recent)
# A completed API might need docs, tests,
# monitoring, and a blog post
# 4. Score and prioritize
candidates = gaps + roadmap_items + followups
scored = await self.score_candidates(
candidates,
context.strategic_priorities,
context.resource_availability
)
# 5. Generate concrete task specs
tasks = []
for candidate in scored[:10]: # Top 10 candidates
task = await self.specialize(candidate)
task.dependencies = self.infer_dependencies(
task, context.dependency_graph
)
task.estimated_duration = self.estimate_duration(
task, context.historical_durations
)
tasks.append(task)
return tasksThe task generator isn’t a prompt that says “come up with things to do.” It’s a structured analysis pipeline that:
- Reads the actual codebase — not a description of it, the code itself
- Identifies concrete gaps — missing tests, incomplete implementations, documentation debt
- Cross-references the roadmap — what should we be building next?
- Generates follow-up work from recently completed tasks — if you built an API, you probably need tests, docs, and monitoring
- Produces fully specified tasks with dependencies, deliverables, and time estimates
Here’s a real example. After the build engine completed the “Marketing Agent — Research & Architecture” task (which produced a 6,500-word spec covering 8 major sections), the task generator automatically created follow-up tasks:
- Implementation of the viral launch playbook automation
- Content template engine for platform-specific formatting
- UTM attribution tracking system
- Social media posting scheduler
Each of these was a concrete, executable task with clear deliverables — not a vague “implement marketing” bullet point.
What Most People Miss: The Cold Start Problem
Everyone who hears about autonomous build engines asks the same question: “How does it know what to build?”
The answer is: it doesn’t, at first. And that’s the cold start problem.
An autonomous build engine needs three things to function:
1. A product vision.Not a backlog — a vision. “We’re building an AI-powered stock advisor for retail investors in India” is enough. The engine can decompose that into components, research the domain, identify technical requirements, and generate an initial task graph.
2. Architectural decisions.The engine needs to know your tech stack, deployment target, and key constraints. “React frontend, Python backend, deploy on Railway, use Supabase for auth” constrains the solution space enough for the engine to generate buildable tasks.
3. A feedback loop. The first tasks the engine generates will be mediocre. The research spike will be too broad. The architecture doc will miss important constraints. The prototype will make questionable design decisions. You need a human reviewing outputs for the first 20-30 tasks, correcting course, and feeding those corrections back into the system.
After that initial calibration period, the engine has enough context to generate tasks that are genuinely useful. It’s learned your preferences — you want comprehensive tests, not minimal ones. You prefer explicit error handling over try/catch-all. You want docs that explain why, not just what.
This is the part that most autonomous coding tools skip. They assume you can go from “build me an app” to working software in one shot. That works for trivial apps. For anything with real requirements, the system needs a calibration period where a human shapes its understanding of quality.
Common Mistakes and Tradeoffs
After running this system for months, here’s what we’ve learned the hard way:
Mistake 1: Over-Parallelizing
Our first instinct was maximum parallelism — run as many tasks as possible on every available node. This created integration nightmares. Five agents building five components simultaneously, each making slightly different assumptions about interfaces and data models.
The fix:Limit parallel tasks to what the dependency graph supports, not what the hardware supports. If three tasks should logically be sequential, don’t parallelize them just because you have three free nodes. Let those nodes work on unrelated branches of the dependency graph.
Mistake 2: Trusting Time Estimates
AI agents are terrible at estimating how long tasks will take. A “2-hour task” might take 20 minutes (if the solution is obvious) or 6 hours (if it hits an edge case that requires research).
The fix:Use historical duration data, not agent estimates. After 100+ tasks, you have a distribution of actual durations by task type. Use the P75 (75th percentile) as your estimate, not the agent’s guess. And always set timeout = 2× estimate to prevent runaway tasks.
Mistake 3: No Human Review Gate
For the first few weeks, we ran the engine fully autonomously — tasks generated, executed, and committed without human review. The output quality was... variable. Some tasks produced excellent work. Others produced plausible-looking code that was fundamentally wrong.
The fix:Staged autonomy. Research and planning tasks run fully autonomously. Implementation tasks require human review before merge. As the engine’s track record improves for specific task types, you can selectively remove the review gate.
Mistake 4: Monolithic Task Specs
Early task specs were too big — “Build the entire authentication system” as a single task. These took forever, produced low-quality output, and were hard to verify.
The fix:Decompose aggressively. The ideal task takes 1-3 hours and produces a single, verifiable deliverable. “Design the auth schema,” “Implement password hashing,” “Build the login API endpoint,” “Write auth middleware,” “Add auth tests” — five small tasks beat one big one, every time.
The Numbers
Since deploying the build engine, here’s what changed:
| Metric | Before (manual) | After (autonomous) |
|---|---|---|
| Tasks completed/week | 8-12 | 45-70 |
| Active build hours/day | 6-8 | 22-23 |
| Time-to-first-commit for new features | 2-3 days | 2-4 hours |
| Queue starvation incidents | N/A | <2/week |
| Human review hours/week | 40+ | 8-10 |
The 22-23 hours of active build time includes about 1-2 hours of daily downtime for node maintenance, updates, and the occasional crash recovery. The system isn’t literally 24/7 — but it’s close enough that the distinction doesn’t matter.
More interesting than the raw throughput numbers: the type of work changed. Before, human time was spent on implementation. Now, human time is spent on review and strategic direction. The humans decide what to build and whether it’s good enough. The engine handles how to build it and when.
Should You Build One?
Probably not yet. But soon.
This architecture works for us because of specific conditions:
1. We’re building multiple products simultaneously.The engine’s value scales with the number of independent workstreams. If you’re building one product with one team, the coordination overhead of the engine outweighs its benefits.
2. We have well-defined quality criteria.The engine can verify its own output because we have explicit standards — test coverage thresholds, documentation requirements, code style rules. If your quality criteria are “it feels right,” the engine can’t verify that.
3. We’re comfortable with AI-generated code. Not everyone is. If your team insists on hand-crafting every line, an autonomous build engine is a cultural mismatch, regardless of its technical merits.
4. The economics favor hardware over people.Running three Mac minis 24/7 costs less per month than a single junior developer in a major tech hub. If your cost structure is different — if you have cheap labor and expensive hardware — the math doesn’t work.
If these conditions apply to you, start small. One node. A JSON queue. A simple orchestrator that dispatches tasks from a manually populated backlog. Get comfortable with the failure modes. Then add the autonomous task generator. Then add more nodes.
The build engine that ships your product while you sleep isn’t a moonshot. It’s a natural extension of the tooling we already have. The hard part isn’t the architecture — it’s trusting it enough to go to bed.
Gaurav Sharma is the founder of Avyay (अव्यय). He builds distributed AI systems using consumer hardware and too much coffee. Follow the build at avyay.ai/blog.