← Back to Blog
Case Study · ShieldOps · June 2026

Autonomous Security Remediation:
When AI Agents Fix What Dependabot Can’t

We pointed ShieldOps at Apache Superset — 500K lines of code, 200+ Python dependencies — and Devin produced 3 PRs in under 8 minutes with zero follow-up messages. Including a Flask 2→3 major version upgrade that the policy boundary correctly routed to human review.

3
PRs Shipped
<8m
Time to First PR
0
Follow-up Messages
7
Issues Auto-Created

The Security Debt Nobody Talks About

Every engineering team has a security debt problem. Not because they don’t care — because the tooling makes caring nearly impossible.

A scanner runs nightly. It finds 247 CVEs across your dependency tree. An engineer gets assigned. They try the bump, watch 14 tests fail, and close the ticket with “needs investigation.”

The industry numbers most organizations quietly accept:

  • Mean Time to Remediate (MTTR) for critical vulns: 60–90 days
  • Percentage of scanner findings that get fixed: <30%
  • Engineering hours per non-trivial CVE: 2–8 hours
  • Dependabot PRs that merge without manual intervention: ~40%

That last number: Dependabot fails on 60% of its own PRs. It bumps versions. When the build breaks, it walks away. The engineer is back to square one.

“Detection is solved. Remediation isn’t. The gap between ‘found’ and ‘fixed’ is where security incidents live.”

What We Built

ShieldOps is an autonomous security remediation platform — not a scanner, not a dashboard. It takes vulnerabilities from “detected” to “pull request ready for review” without human intervention.

Architecture: Trust control plane orchestrating three systems:

  1. Devin AI — autonomous coding agent
  2. Datadog — observability for the remediation pipeline
  3. GitHub — source of truth for code, issues, PRs
Scan → Triage → Devin Fleet → Policy Boundary → GitHub PRs + Datadog

6-Stage Pipeline

StageWhat Happens
01 Scanpip-audit, npm audit, trivy, semgrep
02 TriageSeverity × reachability × fix availability × complexity
03 Devin FleetContext-aware prompts, reads CHANGELOGs, fixes breaking call sites
04 Policy BoundaryAuto-merge · Human review · Blocked
05 Evidence BundleWhat changed, why, blast radius, confidence score
06 DatadogFleet health, trust split, cost, audit trail

The Hero Story: Flask 2→3 in 500K Lines of Code

This is the moment that defines what autonomous remediation actually means.

Apache Superset. 500K lines of Python. Flask 2.3.3, which reached end-of-support. Issue #1 in our auto-created triage: CRITICAL.

What Dependabot Does

Opens a PR bumping Flask from 2.3.3 to 3.1.0. The build fails — breaking imports, changed APIs, deprecated patterns. The PR sits red forever. An engineer closes it with “too complex for automated fix.”

What Devin Did

  1. Read the Flask 3.x CHANGELOG and migration guide
  2. Found all version constraints across 5 files
  3. Updated pyproject.toml dependency spec
  4. Updated requirements/base.txt pin
  5. Updated requirements/development.txt
  6. Fixed integration test imports
  7. Fixed security dataset test compatibility
  8. Verified no breaking API call sites remained
# Before (pyproject.toml)
"flask>=2.2.5, <4.0.0"

# After  
"flask>=3.1.0, <4.0.0"

The result: PR #10— +11/-12 across 5 files. Clean. Mergeable. No human touched it.

View PR #10 on GitHub →

“That’s the work only an autonomous coding agent can do — reading the error, understanding the CHANGELOG, fixing the call sites, iterating until green.”

The Other PRs: Not Just the Hero

ShieldOps didn’t just handle the hard one. Here’s the full picture:

PRTitleWhat Devin DidChanges
#8Dockerfile HardeningPinned base images to SHA256 digests, purged dev packages, added HEALTHCHECK+20/-4
#9Paramiko CVE-2026-44405Upgraded 3.5.1→5.0.0, handled breaking changes (GSSAPI removed, DH modulus), verified API compatibility+8/-4
#10Flask 2.3→3.x (Hero)Major version upgrade, fixed imports across 5 files in 500K LOC codebase+11/-12

7 Issues Auto-Created with Severity Labels

SeverityCountExamples
CRITICAL1Flask EOL upgrade
HIGH3SQLAlchemy 1.4→2.0, flask-sqlalchemy, npm audit
MEDIUM1Dockerfile hardening
LOW1Paramiko CVE (CVSS 3.4)

The Trust Boundary

The VP question: “Is this thing safe to run?”

Not removing humans — making their job trivial. Three tiers:

🟢
Auto-Merge Ready
Tests pass, no breaking changes, high confidence, patch/minor upgrade
🟡
Human Review
Major upgrade, breaking changes fixed, sensitive paths touched — with 2-minute evidence bundle
🔴
Blocked
Tests fail or confidence too low — nothing merges, alert fires

VP Dashboard

3
Sessions Launched
3
PRs Delivered
&check;
Hero PR Shipped
7
Issues Created
1
Breaking Change Handled
0
Follow-up Messages
9
Files Changed
<8m
Time to First PR

The Enterprise Use Case

Every enterprise with 50+ repos faces the same math.

MetricManual ProcessShieldOps
Cost per CVE fix$600 (engineer time)~$15 (Devin session)
MTTR for critical vulns60–90 daysHours
Remediation coverage<30% of findings80%+
Audit evidenceManual, inconsistentAutomated, every fix
ScaleLinear with headcountConcurrent fleet

What CISOs actually care about:

  • Remediation velocity (not scan counts)
  • Evidence for auditors (evidence bundles on every PR)
  • Predictable cost per fix
  • Fleet scaling without headcount

What Most People Miss

The value isn’t fixing CVEs faster. It’s building a trust layer that lets autonomous agents operate safely in production codebases.

Security remediation is the perfect proving ground:

  • Bounded scope (one CVE, one fix)
  • Measurable success (tests pass or they don’t)
  • Natural trust tiers (auto-merge, human review, blocked)
  • Evidence-rich (every fix has a paper trail)

The pattern — scan → triage → autonomous execution → policy boundary → human oversight — applies to all autonomous engineering. Feature development. Refactoring. Migration. Security is just application #1.

“ShieldOps isn’t a faster vulnerability scanner. It’s a trust control plane for an autonomous engineering workforce.”

Tradeoffs and Honest Assessment

What Doesn’t Work Yet

  • Architectural migrations (SQLAlchemy 1.4→2.0 requires understanding query patterns across entire codebase)
  • Test suite fragility (if existing tests are bad, Devin can’t tell if its fix broke something real)
  • Session failure rate: 15–20% of sessions don’t converge (complexity exceeds agent capability)
  • Cost: Devin sessions aren’t free. At scale, ACU budgeting becomes a real concern.

Where Humans Are Still Essential

  • Setting policy boundaries
  • Reviewing major architectural changes
  • Expanding auto-merge rules as trust data accumulates

Where This Goes

  • CI integration: scan on every merge, not just scheduled
  • Multi-repo fleet: same policy boundary across every repo in the org
  • Expanding auto-merge: as confidence data accumulates, more fixes qualify for auto-merge
  • ACU budgeting: cost optimization at fleet scale
  • Beyond security: the same architecture for feature development, refactoring, migration

Gaurav Sharma is the founder of Avyay (अव्यय). ShieldOps is open source at github.com/gaurav21/shieldops. Read more about the platform at avyay.ai/products.

Open Source

Try ShieldOps

Autonomous security remediation for your codebase. Devin AI + Datadog + GitHub — from scan to verified fix.

View on GitHub →