Can AI agents modernize legacy code?

Yes, but not by pressing a button. AI coding agents like Claude Code, OpenAI Codex, and Aider can map dependencies, generate test coverage, refactor modules, and update syntax — but they need human guidance on architectural decisions, business logic validation, and integration strategy. McKinsey's LegacyX program reports 40-50% timeline acceleration when AI agents are used for legacy modernization. The key is structured orchestration: break the codebase into modules, assign each to an agent running in an isolated git worktree, and use a shared task board to coordinate the work.

How long does AI-assisted legacy code modernization take?

For a 100K-500K line codebase, a structured four-week sprint can accomplish what traditionally takes 3-6 months. Week 1: codebase mapping and dependency analysis. Week 2: automated test generation (agents can produce 60-80% coverage for code that previously had near-zero tests). Week 3: parallel refactoring with 4-8 agents working on separate modules simultaneously. Week 4: integration, validation, and deployment. Larger codebases (500K-2M+ lines) typically need 2-3 of these four-week cycles, each targeting a different subsystem.

What types of legacy code modernization can AI agents handle?

AI agents are most effective at: (1) Language version upgrades (Python 2→3, Java 8→21, PHP 5→8, Ruby 2→3), (2) Framework migrations (jQuery→React, AngularJS→Angular, Rails 5→7, Django 2→5), (3) Dependency updates and vulnerability remediation, (4) Dead code removal and unused import cleanup, (5) Test generation for untested code, (6) API modernization (REST→GraphQL, SOAP→REST), (7) Code style normalization and linting fixes. They are less effective at: architectural redesigns (monolith→microservices), business logic rewrites, and database schema migrations — those need human architectural judgment.

How much does AI-assisted legacy modernization cost?

For a solo developer or small team using the DIY approach in this guide: $400-1,600 over four weeks. That breaks down to $100-200/month per developer for Claude Code Max or similar tool, plus $50-200/month in additional API tokens for parallel agent sessions. Compare this to hiring a modernization consultancy, which typically costs $150,000-500,000+ for a comparable codebase. The tradeoff: the DIY approach requires developers who understand the legacy system and can validate agent output. The consultancy approach is more hands-off but 100-300x more expensive.

What is the Strangler Fig pattern for legacy modernization?

The Strangler Fig pattern, described by Martin Fowler, is a strategy where you gradually replace a legacy system by building new functionality around it — like a strangler fig tree growing around its host. Instead of rewriting the entire codebase at once (big bang migration), you replace one module at a time while the legacy system continues running. AI agents accelerate this pattern dramatically: each agent can refactor one module in an isolated git worktree while the legacy system stays operational. When the module's tests pass, you merge the refactored version and move to the next module.

Should I use Claude Code, Codex, or Aider for legacy modernization?

Each tool has strengths for different phases. Claude Code (with Opus 4.6) excels at architectural analysis, complex multi-file refactors, and tasks requiring deep reasoning about unfamiliar codebases — making it ideal for Weeks 1 and 3. OpenAI Codex CLI is strong for parallelized, scoped tasks like batch test generation (Week 2). Aider works well for interactive, file-by-file refactoring when you want tight human control. For maximum throughput, use amux to run all three in parallel: Claude Code sessions for complex refactors, Codex for batch operations, and Aider for the modules that need careful human guidance.

How do I prevent AI agents from breaking legacy code during refactoring?

Four safety layers: (1) Git worktree isolation — each agent works in its own worktree so changes cannot conflict. (2) Characterization tests — before refactoring, generate tests that capture the existing behavior (even if the behavior has bugs, the tests document it). (3) CI gates — every agent PR must pass the full test suite before merge. (4) Incremental merges — merge one module at a time, run integration tests after each merge, and only proceed if everything passes. The combination of worktree isolation + characterization tests + CI gates means an agent can never silently break the production system.

May 2026

AI Agents for Legacy Code Modernization: The Developer’s Week-by-Week DIY Guide

Every guide on AI legacy modernization is written by consultancies selling six-figure engagements. This one is written for developers who want to do it themselves. A four-week playbook: map a legacy codebase with AI agents, generate test coverage from near-zero, refactor module by module using the Strangler Fig pattern, and ship — with parallel agent workflows, safety guardrails, and realistic cost estimates for codebases from 100K to 2M+ lines.

40–50%

Timeline acceleration with AI-assisted modernization

McKinsey LegacyX, 2026

$150K+

Typical consultancy cost for legacy modernization

Enterprise modernization engagements

$400–1.6K

DIY cost with AI agents over four weeks

Based on Claude Code Max pricing

Why AI Agents Change the Legacy Modernization Equation

Legacy modernization has always been a math problem with terrible numbers. Rewriting a 300K-line codebase takes 12–18 months with a team of 4–6 developers. Hiring a consultancy costs $150,000–500,000+. The project has a 50–70% failure rate because requirements drift during the rewrite. So teams live with the legacy code, adding duct tape until the next person inherits the mess.

AI coding agents break this equation in three ways:

1. Agents read legacy code without complaining

A developer staring at 200K lines of undocumented PHP 5 will spend weeks building a mental model. Claude Code with Opus 4.6 can ingest the full context, map dependencies, and produce an architectural inventory in hours. It does not need motivation, onboarding, or context-switching time. It will not quit.

2. Agents generate tests for code that was never tested

The biggest blocker in legacy modernization is not the refactoring — it is the lack of tests. You cannot safely refactor code that has no tests. But writing tests for legacy code is mind-numbing, tedious work that developers avoid. AI agents will happily generate hundreds of characterization tests that capture existing behavior, giving you the safety net needed before changing anything.

3. Parallel agents turn a serial project into a parallel one

A human team refactors one module at a time. An amux fleet of 4–8 agents refactors 4–8 modules simultaneously, each in its own git worktree, coordinated through a shared task board. What took 3 months sequentially takes 3 weeks in parallel.

“Agentic AI didn’t just accelerate legacy modernization. It made it economically viable for the first time for companies that aren’t Fortune 500.”

— McKinsey LegacyX program, 2026

What AI Agents Can and Cannot Modernize

AI agents are not magic. They excel at pattern-based transformations and struggle with architectural judgment. Here is the honest breakdown:

AI agents excel at

Language version upgrades — Python 2→3, Java 8→21, PHP 5→8, Ruby 2→3
Framework migrations — jQuery→React, AngularJS→Angular, Rails 5→7, Django 2→5
Dependency updates — resolving outdated packages, fixing breaking changes, removing vulnerabilities
Dead code removal — identifying and removing unused functions, imports, routes, and config
Test generation — writing unit, integration, and characterization tests for untested code
API modernization — SOAP→REST, REST→GraphQL, callback→async/await
Code style normalization — enforcing consistent formatting, naming conventions, and linting rules
Documentation generation — producing inline docs, API references, and architecture diagrams from undocumented code

AI agents struggle with

Architectural redesigns — monolith→microservices requires human judgment about service boundaries, data ownership, and team topology
Business logic rewrites — agents can migrate syntax but cannot validate that a pricing calculation, compliance rule, or workflow is still correct without explicit specs
Database schema migrations — changing data models affects every query, report, and integration; agents need human-defined migration plans
Cross-system integration changes — when the legacy system talks to 15 other services, changing interfaces requires coordination beyond the codebase

The rule of thumb: if the transformation is a well-defined pattern with clear before/after states (Python 2 print statements → Python 3 print() functions), agents handle it reliably. If the transformation requires understanding why the business logic exists and whether it should change, you need a human making the call and an agent executing it.

Tool Selection: Which Agent for Which Phase

Phase	Best tool	Why	Alternative
Codebase mapping	Claude Code	Opus 4.6 excels at reading and reasoning about unfamiliar codebases. 200K context window handles large files.	Cursor (visual navigation)
Dependency analysis	Claude Code + ast-grep	Combine AI reasoning with AST-based search for structural dependency mapping	Dependabot (for package deps)
Test generation	Claude Code or Codex	Both handle batch test generation well. Run in parallel for maximum throughput.	Aider (interactive test refinement)
Module refactoring	Claude Code	Deep reasoning needed. Multi-file changes. Worktree isolation built in.	Cursor v3 (visual diff review)
Security scanning	Semgrep + Snyk	Static analysis tools catch what agents miss. Run as CI gates on every agent PR.	SonarQube
Orchestration	amux	Shared task board, inter-session messaging, mobile monitoring, crash recovery	DIY tmux + scripts

Week 1: Map the Codebase

Week 1 — Days 1–2

Architectural Inventory

Before changing anything, you need to know what you have. Launch a Claude Code session and ask it to produce an architectural inventory of the codebase. This is the single most valuable task for an AI agent in legacy work — what takes a developer weeks of code reading takes an agent hours.

The inventory should include:

Module map — every major directory, its purpose, and its entry points
Dependency graph — which modules depend on which, including circular dependencies
Technology inventory — language versions, frameworks, libraries, and their ages
Test coverage — which modules have tests and which have zero coverage
Dead code candidates — functions, routes, and files that appear unreachable

## Example prompt for Claude Code

Read this entire codebase and produce an architectural inventory. For each
top-level directory/module:
1. Its purpose (one sentence)
2. Key files and entry points
3. Dependencies (other modules it imports from)
4. External dependencies (third-party packages)
5. Test coverage (has tests / no tests / partial)
6. Language version and framework version

Then identify: circular dependencies, dead code candidates (unreachable
functions/routes), and the 5 riskiest modules (most complex, least tested,
most dependencies).

Sessions: 1 Claude Code session. Estimated time: 2–4 hours for a 200K-line codebase. Cost: ~$5–15 in tokens.

Week 1 — Days 3–4

Dependency Analysis and Dead Code Removal

Use ast-grep and Claude Code together to map internal dependencies at the function/class level. Then use an agent to remove confirmed dead code — this is the safest first change because deleting unused code cannot break anything that was actually running.

Run 2–3 parallel agents via amux, each targeting a different module’s dead code:

## Agent 1: Remove dead code in /src/legacy-auth/
## Agent 2: Remove dead code in /src/legacy-billing/
## Agent 3: Remove dead code in /src/legacy-reports/

Each agent: analyze imports, find unreachable functions, remove them,
run existing tests to confirm nothing breaks. Commit each removal
separately with a clear message.

Sessions: 2–3 parallel Claude Code sessions. Estimated time: 4–8 hours. Cost: ~$10–30.

Week 1 — Day 5

Modernization Plan

With the architectural inventory and clean dependency graph in hand, produce the modernization plan. This is the one step that must be human-driven — the agent provides data, you make the decisions:

Module priority order — which modules to modernize first (start with the least-coupled, best-tested modules)
Migration targets — what each module should look like after modernization (new framework, new API style, new patterns)
Risk assessment — which modules need careful human review vs. which can be agent-driven
Parallelism plan — which modules can be refactored simultaneously (no shared dependencies)

Sessions: 1 Claude Code session (for generating the plan draft). Human time: 2–4 hours reviewing and adjusting.

Week 2: Generate Test Coverage

This is the most important week. You cannot safely refactor code that has no tests. Week 2 is not glamorous, but it is the foundation for everything that follows. Every hour invested in test generation saves 3–5 hours of debugging during refactoring.

Week 2 — Days 1–3

Characterization Tests

Characterization tests capture what the code actually does, not what it should do. Even if the code has bugs, the tests document the current behavior. This is critical: during refactoring, you need to know whether a behavior change was intentional or accidental.

Run 4–6 agents in parallel, each generating characterization tests for a different module:

## Prompt template for each agent

Read /src/MODULE_NAME/ and generate characterization tests using
FRAMEWORK (pytest/Jest/PHPUnit).

For every public function and API endpoint:
1. Write a test that captures the current input/output behavior
2. Include edge cases you can identify from the code
3. Test error paths and exception handling
4. Do NOT fix bugs — document the current behavior, even if it looks wrong
5. Add a comment "# Characterization test — captures legacy behavior"
   to distinguish from intentional tests

Run the tests and ensure they all pass against the current code.

Use pytest for Python, Jest for JavaScript/TypeScript, PHPUnit for PHP, JUnit for Java. Match the testing framework to the ecosystem.

Sessions: 4–6 parallel agents via amux. Estimated time: 8–16 hours. Cost: ~$30–80. Expected result: 60–80% code coverage for modules that previously had near-zero coverage.

Week 2 — Days 4–5

Integration Tests and Human Review

Characterization tests cover individual modules. Integration tests cover the interactions between modules — this is where legacy systems hide their worst bugs. Generate integration tests for the critical paths identified in Week 1’s architectural inventory.

Then spend Day 5 reviewing the generated tests. Look for:

False confidence — tests that always pass because they test the wrong thing
Missing edge cases — boundary conditions that the agent could not infer from code alone
Business logic gaps — tests that capture the how but miss the why (add intent comments yourself)

Sessions: 2–3 parallel agents for integration tests. Human review: 4–6 hours on Day 5.

Do not skip test review. AI-generated tests can give false confidence. A SonarSource study found that 48% of developers do not verify AI-generated code before committing. For legacy modernization, unreviewed tests are worse than no tests — they create a false safety net that hides regressions. Spend the time on Day 5. It is the highest-leverage review of the entire project.

Week 3: Refactor Module by Module

This is the Strangler Fig pattern in action. Each agent refactors one module in its own git worktree while the legacy system continues running. When the module’s tests pass, you merge the refactored version. The legacy system gradually transforms into the modern one, one module at a time.

Week 3 — Days 1–4

Parallel Refactoring Sprint

Launch 4–8 agents, each in its own git worktree, each refactoring a different module according to the modernization plan from Week 1:

## Example: 6 agents refactoring 6 modules

Agent 1 (worktree: wt-auth)    → Refactor /src/auth/ from PHP 5 → PHP 8
Agent 2 (worktree: wt-billing) → Refactor /src/billing/ from jQuery → React
Agent 3 (worktree: wt-reports) → Refactor /src/reports/ from callbacks → async/await
Agent 4 (worktree: wt-api)     → Refactor /src/api/ from REST v1 → REST v3
Agent 5 (worktree: wt-models)  → Refactor /src/models/ from raw SQL → ORM
Agent 6 (worktree: wt-config)  → Refactor /src/config/ from env files → typed config

Each agent:
1. Create a feature branch in its worktree
2. Refactor the module according to the spec
3. Run the module's characterization tests — they MUST still pass
4. Run the integration tests that touch this module
5. Commit with a clear message describing the migration

Monitor all sessions from the amux dashboard or the mobile PWA. When an agent finishes, review its diff and either approve or redirect.

Sessions: 4–8 parallel agents. Estimated time: 16–32 hours of agent runtime (2–4 days with overnight runs). Human time: 2–3 hours/day reviewing diffs. Cost: ~$80–200.

Week 3 — Day 5

Conflict Resolution and Second Pass

Some modules will have shared interfaces. When Agent 2 changes a function signature that Agent 4 also calls, you get a conflict. This is expected and manageable:

Merge the module with the most downstream dependencies first (usually the models/data layer)
Rebase other agent branches onto the updated main
Send each rebased agent a message to fix the conflicts: “The function signature for get_user() changed. Update your module to match the new interface and re-run tests.”
Run the full integration test suite after each merge

Sessions: 2–4 agents for conflict resolution. Human time: 4–6 hours for merge order decisions and integration testing.

Week 4: Integrate, Validate, and Ship

Week 4 — Days 1–2

Full Integration Testing

With all refactored modules merged, run the complete test suite. Fix failures using focused agent sessions — each failure goes to an agent with the specific test, the error message, and the relevant module code.

Run Semgrep and Snyk security scans against the refactored codebase. AI agents can introduce subtle security regressions during refactoring — static analysis catches what tests miss.

Week 4 — Days 3–4

Documentation and Migration Guide

Generate documentation for the modernized codebase. This is where AI agents shine — they are tireless documenters. Run 2–3 agents:

Agent 1: Generate API documentation for all public interfaces
Agent 2: Write a migration guide documenting every breaking change
Agent 3: Update README, CHANGELOG, and deployment instructions

Week 4 — Day 5

Ship

Deploy the modernized codebase. For most teams, this means a blue-green deployment or feature-flagged rollout — the legacy system stays running until the modernized version is validated in production.

All characterization tests pass
All integration tests pass
Security scans clean (or known issues documented)
Migration guide reviewed and approved
Rollback plan documented and tested
Monitoring/alerting configured for the new code paths

Parallel Agent Orchestration

The difference between “using AI for legacy modernization” and “finishing a legacy modernization in four weeks” is parallelism. Here is how to orchestrate it:

Git worktree isolation

Every agent must work in its own git worktree. This is non-negotiable — without it, agents will overwrite each other’s changes. Claude Code has built-in worktree support. With amux, each session automatically gets its own worktree.

Shared task board

Use amux’s kanban board to track which module each agent is working on. Statuses: todo (not started), doing (agent working), review (waiting for human review), done (merged). This gives you a single view of the entire modernization sprint — from your desktop or from the mobile PWA on your phone.

Merge order protocol

Not all modules can be merged in any order. The rule: merge bottom-up through the dependency graph. Data models and shared utilities first, then the modules that depend on them, then top-level routes and controllers last. This minimizes rebase conflicts and ensures each merge has a stable foundation.

The orchestration overhead is small. For a 6-agent parallel refactoring sprint, expect 2–3 hours per day of human overhead: reviewing agent diffs, resolving merge conflicts, and validating test results. The agents do 90% of the work; you make the judgment calls.

Cost Breakdown

DIY with AI agents

4-week modernization sprint

$400 – $1,600

Claude Code Max: $100–200/mo per developer
Additional API tokens for parallel agents: $50–200/mo
Static analysis tools (Semgrep, Snyk): $0–100/mo (free tiers available)
Human time: 40–60 hours over 4 weeks (review, decisions, validation)

Modernization consultancy

Comparable engagement

$150K – $500K+

Team of 4–6 consultants: $200–350/hour
Duration: 3–6 months
Project management overhead: 20–30% of budget
Risk of scope creep and timeline overrun: high

Cost estimates assume a 200K–500K line codebase. For larger codebases (1M+), multiply the DIY cost by the number of 4-week cycles needed (typically 2–3 cycles). Consultancy costs scale roughly linearly with codebase size.

Safety Guardrails

The number one risk in AI-assisted legacy modernization is breaking production. These four safety layers prevent it:

Layer 1: Characterization tests (from Week 2)

Every refactored module must pass its characterization tests. If a test fails after refactoring, either the refactoring introduced a regression (fix it) or the test was capturing a bug that the refactoring correctly eliminated (update the test with a comment explaining why).

Layer 2: Git worktree isolation

Agents cannot touch each other’s work. Each agent operates in its own worktree on its own branch. Changes only reach main through reviewed, tested pull requests.

Layer 3: CI gates

Every agent PR triggers the full test suite, Semgrep security scan, and ESLint/SonarQube code quality checks. The PR cannot merge if any gate fails. This is the same CI pipeline you use for human PRs — agents do not get special treatment.

Layer 4: Incremental merges

Merge one module at a time. Run integration tests after each merge. Only proceed to the next module if everything passes. If a merge breaks integration, revert it, fix the agent’s output, and re-merge. This sounds slow, but it is far faster than debugging a big-bang merge of 6 modules at once.

Claude Code hooks for legacy refactoring

Add hooks that enforce safety during refactoring sessions:

// .claude/settings.json — PreToolUse hook
{
  "hooks": {
    "PreToolUse": [{
      "matcher": "Write|Edit",
      "command": "python3 -c \"import ast; ast.parse(open('$FILE').read())\" 2>&1",
      "description": "Block writes that produce invalid syntax"
    }]
  }
}

See the safety checklist for more enforcement patterns.

Scaling Beyond 500K Lines

For codebases above 500K lines, a single four-week sprint is not enough. Scale with repeated cycles:

Codebase size	Cycles	Total timeline	Parallel agents	Estimated cost (DIY)
100K–200K lines	1 cycle	4 weeks	4–6	$400–800
200K–500K lines	1–2 cycles	4–8 weeks	6–8	$800–1,600
500K–1M lines	2–3 cycles	8–12 weeks	8–10	$1,600–3,200
1M–2M+ lines	3–5 cycles	12–20 weeks	10+	$3,200–8,000

Each cycle targets a different subsystem. Use the first cycle’s architectural inventory to prioritize: start with the subsystem that has the most technical debt, the most pain, or the most upcoming feature work that would benefit from a modern foundation.

For codebases above 1M lines, consider Red Hat’s agent mesh approach: specialized agents for different roles (code analysis, test generation, refactoring, security review) rather than generalist agents that do everything. amux’s shared board coordinates these specialist agents through task assignment and status tracking.

Anti-Patterns That Kill Legacy Modernization Projects

Big bang rewrite

Rewriting the entire codebase at once, then switching over. This fails 50–70% of the time because requirements drift during the rewrite, the new system has its own bugs, and there is no incremental validation. AI agents make this temptation worse (“it’s so fast, let’s just rewrite everything!”). Use the Strangler Fig pattern instead. Module by module. One merge at a time.

Refactoring without tests

Skipping Week 2 because “we’ll just run it and see if it works.” Legacy code has implicit behavior that is not obvious from reading the code. Without characterization tests, you will not know whether a refactoring changed behavior until a customer reports a bug. Test generation is the foundation. Do not skip it.

Trusting agent output without review

AI agents produce plausible-looking code that passes tests but introduces subtle regressions. Cognitive debt research shows that AI-co-authored code has 1.7x more issues and 2.74x more security vulnerabilities. Every agent PR gets human review. The review can be faster than reviewing human code (agents produce consistent style), but it cannot be skipped.

Modernizing everything at once

Trying to upgrade the language version AND migrate the framework AND restructure the architecture in one pass. Each of these is a separate project. Stack them sequentially: language upgrade first (smallest blast radius), framework migration second, architecture changes last. AI agents are fast, but compounding three types of change in one refactoring makes failures undiagnosable.

No merge order discipline

Merging refactored modules in random order creates cascading conflicts. Always merge bottom-up through the dependency graph: shared utilities and data models first, then the modules that depend on them, then top-level entry points last. amux’s kanban board makes this visible: columns for todo, doing, review, and done map directly to the merge pipeline.

FAQ

What languages work best for AI-assisted legacy modernization?

Python, JavaScript/TypeScript, and Java have the best results because AI models have seen the most training data for these languages. PHP, Ruby, and C# also work well. COBOL, Fortran, and other rare languages produce mixed results — models can read the code but struggle with idiomatic transformations. For rare languages, use the agent for analysis and test generation, but handle the refactoring more interactively with Aider where you guide each change.

Can I use this approach for a monolith-to-microservices migration?

Partially. AI agents can identify service boundaries, extract modules, and set up new service scaffolding. But the architectural decisions (how to split the data model, how to handle cross-service transactions, what communication pattern to use) require human judgment. Use this four-week approach for each service extraction: map the module, generate its tests, refactor it into a standalone service, and validate. The overall microservices strategy is a multi-month project with this guide applied in cycles.

How do I handle a codebase with zero existing tests?

This is the most common scenario for legacy code, and it is exactly what Week 2 addresses. Start by installing a test framework (pytest, Jest, etc.) and running the first characterization test agent on the simplest module. Expect to iterate: the first round of generated tests will have issues, but each module gets easier as the agent builds familiarity with the codebase’s patterns. Aim for 60–80% coverage on critical paths, not 100% on everything.

What if the legacy codebase is too large for an AI context window?

Claude Code handles this automatically through context compaction — it reads files on demand rather than loading the entire codebase. For the Week 1 architectural inventory, point the agent at the top-level directory and let it traverse the tree. It will read files as needed, building its understanding incrementally. For very large codebases (1M+ lines), split the inventory task across multiple agents, each covering a different subsystem. See our context engineering guide for strategies on managing agent context at scale.

Is this approach suitable for regulated industries (healthcare, finance)?

Yes, with additional guardrails. For regulated codebases: (1) Run agents against a local/private model if data cannot leave your network (use Aider with a self-hosted model via AWS Bedrock), (2) Add compliance-specific characterization tests in Week 2 (regulatory calculations, audit trail behavior, data handling paths), (3) Require two-person review for agent PRs that touch compliance-critical modules, and (4) Keep a detailed audit log of every agent-generated change (amux session logs provide this automatically). See our security hardening guide for more.

How does this compare to Morgan Stanley’s approach?

Morgan Stanley built a custom AI tool that converted 9 million lines of code, saving an estimated 280,000 developer hours. Their approach was enterprise-scale: custom-trained models, dedicated infrastructure, months of development on the tooling itself. The approach in this guide uses off-the-shelf AI coding agents and is designed for teams of 1–5 developers working on 100K–2M line codebases. You trade Morgan Stanley’s scale for speed and cost: they spent millions building bespoke tooling; you spend $400–1,600 using existing tools in a structured workflow.