ShieldOps: Autonomous Security Remediation with Devin AI + Datadog

PRs Shipped

<8m

Time to First PR

Follow-up Messages

Issues Auto-Created

The Security Debt Nobody Talks About

Every engineering team has a security debt problem. Not because they don’t care — because the tooling makes caring nearly impossible.

A scanner runs nightly. It finds 247 CVEs across your dependency tree. An engineer gets assigned. They try the bump, watch 14 tests fail, and close the ticket with “needs investigation.”

The industry numbers most organizations quietly accept:

Mean Time to Remediate (MTTR) for critical vulns: 60–90 days
Percentage of scanner findings that get fixed: <30%
Engineering hours per non-trivial CVE: 2–8 hours
Dependabot PRs that merge without manual intervention: ~40%

That last number: Dependabot fails on 60% of its own PRs. It bumps versions. When the build breaks, it walks away. The engineer is back to square one.

“Detection is solved. Remediation isn’t. The gap between ‘found’ and ‘fixed’ is where security incidents live.”

What We Built

ShieldOps is an autonomous security remediation platform — not a scanner, not a dashboard. It takes vulnerabilities from “detected” to “pull request ready for review” without human intervention.

Architecture: Trust control plane orchestrating three systems:

Devin AI — autonomous coding agent
Datadog — observability for the remediation pipeline
GitHub — source of truth for code, issues, PRs

Scan → Triage → Devin Fleet → Policy Boundary → GitHub PRs + Datadog

6-Stage Pipeline

Stage	What Happens
01 Scan	pip-audit, npm audit, trivy, semgrep
02 Triage	Severity × reachability × fix availability × complexity
03 Devin Fleet	Context-aware prompts, reads CHANGELOGs, fixes breaking call sites
04 Policy Boundary	Auto-merge · Human review · Blocked
05 Evidence Bundle	What changed, why, blast radius, confidence score
06 Datadog	Fleet health, trust split, cost, audit trail

The Hero Story: Flask 2→3 in 500K Lines of Code

This is the moment that defines what autonomous remediation actually means.

Apache Superset. 500K lines of Python. Flask 2.3.3, which reached end-of-support. Issue #1 in our auto-created triage: CRITICAL.

What Dependabot Does

Opens a PR bumping Flask from 2.3.3 to 3.1.0. The build fails — breaking imports, changed APIs, deprecated patterns. The PR sits red forever. An engineer closes it with “too complex for automated fix.”

What Devin Did

Read the Flask 3.x CHANGELOG and migration guide
Found all version constraints across 5 files
Updated pyproject.toml dependency spec
Updated requirements/base.txt pin
Updated requirements/development.txt
Fixed integration test imports
Fixed security dataset test compatibility
Verified no breaking API call sites remained

# Before (pyproject.toml)
"flask>=2.2.5, <4.0.0"

# After  
"flask>=3.1.0, <4.0.0"

The result: PR #10— +11/-12 across 5 files. Clean. Mergeable. No human touched it.

View PR #10 on GitHub →

“That’s the work only an autonomous coding agent can do — reading the error, understanding the CHANGELOG, fixing the call sites, iterating until green.”

The Other PRs: Not Just the Hero

ShieldOps didn’t just handle the hard one. Here’s the full picture:

PR	Title	What Devin Did	Changes
#8	Dockerfile Hardening	Pinned base images to SHA256 digests, purged dev packages, added HEALTHCHECK	+20/-4
#9	Paramiko CVE-2026-44405	Upgraded 3.5.1→5.0.0, handled breaking changes (GSSAPI removed, DH modulus), verified API compatibility	+8/-4
#10	Flask 2.3→3.x (Hero)	Major version upgrade, fixed imports across 5 files in 500K LOC codebase	+11/-12

7 Issues Auto-Created with Severity Labels

Severity	Count	Examples
CRITICAL	1	Flask EOL upgrade
HIGH	3	SQLAlchemy 1.4→2.0, flask-sqlalchemy, npm audit
MEDIUM	1	Dockerfile hardening
LOW	1	Paramiko CVE (CVSS 3.4)

The Trust Boundary

The VP question: “Is this thing safe to run?”

Not removing humans — making their job trivial. Three tiers:

🟢

Auto-Merge Ready

Tests pass, no breaking changes, high confidence, patch/minor upgrade

🟡

Human Review

Major upgrade, breaking changes fixed, sensitive paths touched — with 2-minute evidence bundle

🔴

Blocked

Tests fail or confidence too low — nothing merges, alert fires

VP Dashboard

Sessions Launched

PRs Delivered

&check;

Hero PR Shipped

Issues Created

Breaking Change Handled

Follow-up Messages

Files Changed

<8m

Time to First PR

The Enterprise Use Case

Every enterprise with 50+ repos faces the same math.

Metric	Manual Process	ShieldOps
Cost per CVE fix	$600 (engineer time)	~$15 (Devin session)
MTTR for critical vulns	60–90 days	Hours
Remediation coverage	<30% of findings	80%+
Audit evidence	Manual, inconsistent	Automated, every fix
Scale	Linear with headcount	Concurrent fleet

What CISOs actually care about:

Remediation velocity (not scan counts)
Evidence for auditors (evidence bundles on every PR)
Predictable cost per fix
Fleet scaling without headcount

What Most People Miss

The value isn’t fixing CVEs faster. It’s building a trust layer that lets autonomous agents operate safely in production codebases.

Security remediation is the perfect proving ground:

Bounded scope (one CVE, one fix)
Measurable success (tests pass or they don’t)
Natural trust tiers (auto-merge, human review, blocked)
Evidence-rich (every fix has a paper trail)

The pattern — scan → triage → autonomous execution → policy boundary → human oversight — applies to all autonomous engineering. Feature development. Refactoring. Migration. Security is just application #1.

“ShieldOps isn’t a faster vulnerability scanner. It’s a trust control plane for an autonomous engineering workforce.”

Tradeoffs and Honest Assessment

What Doesn’t Work Yet

Architectural migrations (SQLAlchemy 1.4→2.0 requires understanding query patterns across entire codebase)
Test suite fragility (if existing tests are bad, Devin can’t tell if its fix broke something real)
Session failure rate: 15–20% of sessions don’t converge (complexity exceeds agent capability)
Cost: Devin sessions aren’t free. At scale, ACU budgeting becomes a real concern.

Where Humans Are Still Essential

Setting policy boundaries
Reviewing major architectural changes
Expanding auto-merge rules as trust data accumulates

Where This Goes

CI integration: scan on every merge, not just scheduled
Multi-repo fleet: same policy boundary across every repo in the org
Expanding auto-merge: as confidence data accumulates, more fixes qualify for auto-merge
ACU budgeting: cost optimization at fleet scale
Beyond security: the same architecture for feature development, refactoring, migration

Gaurav Sharma is the founder of Avyay (अव्यय). ShieldOps is open source at github.com/gaurav21/shieldops. Read more about the platform at avyay.ai/products.

Autonomous Security Remediation:
When AI Agents Fix What Dependabot Can’t