May 2026

AI Agents for Legacy Code Modernization: The Developer’s Week-by-Week DIY Guide

Every guide on AI legacy modernization is written by consultancies selling six-figure engagements. This one is written for developers who want to do it themselves. A four-week playbook: map a legacy codebase with AI agents, generate test coverage from near-zero, refactor module by module using the Strangler Fig pattern, and ship — with parallel agent workflows, safety guardrails, and realistic cost estimates for codebases from 100K to 2M+ lines.

40–50%
Timeline acceleration with AI-assisted modernization
$150K+
Typical consultancy cost for legacy modernization
Enterprise modernization engagements
$400–1.6K
DIY cost with AI agents over four weeks
Based on Claude Code Max pricing

Why AI Agents Change the Legacy Modernization Equation

Legacy modernization has always been a math problem with terrible numbers. Rewriting a 300K-line codebase takes 12–18 months with a team of 4–6 developers. Hiring a consultancy costs $150,000–500,000+. The project has a 50–70% failure rate because requirements drift during the rewrite. So teams live with the legacy code, adding duct tape until the next person inherits the mess.

AI coding agents break this equation in three ways:

1. Agents read legacy code without complaining

A developer staring at 200K lines of undocumented PHP 5 will spend weeks building a mental model. Claude Code with Opus 4.6 can ingest the full context, map dependencies, and produce an architectural inventory in hours. It does not need motivation, onboarding, or context-switching time. It will not quit.

2. Agents generate tests for code that was never tested

The biggest blocker in legacy modernization is not the refactoring — it is the lack of tests. You cannot safely refactor code that has no tests. But writing tests for legacy code is mind-numbing, tedious work that developers avoid. AI agents will happily generate hundreds of characterization tests that capture existing behavior, giving you the safety net needed before changing anything.

3. Parallel agents turn a serial project into a parallel one

A human team refactors one module at a time. An amux fleet of 4–8 agents refactors 4–8 modules simultaneously, each in its own git worktree, coordinated through a shared task board. What took 3 months sequentially takes 3 weeks in parallel.

“Agentic AI didn’t just accelerate legacy modernization. It made it economically viable for the first time for companies that aren’t Fortune 500.”

What AI Agents Can and Cannot Modernize

AI agents are not magic. They excel at pattern-based transformations and struggle with architectural judgment. Here is the honest breakdown:

AI agents excel at

AI agents struggle with

The rule of thumb: if the transformation is a well-defined pattern with clear before/after states (Python 2 print statements → Python 3 print() functions), agents handle it reliably. If the transformation requires understanding why the business logic exists and whether it should change, you need a human making the call and an agent executing it.

Tool Selection: Which Agent for Which Phase

Phase Best tool Why Alternative
Codebase mapping Claude Code Opus 4.6 excels at reading and reasoning about unfamiliar codebases. 200K context window handles large files. Cursor (visual navigation)
Dependency analysis Claude Code + ast-grep Combine AI reasoning with AST-based search for structural dependency mapping Dependabot (for package deps)
Test generation Claude Code or Codex Both handle batch test generation well. Run in parallel for maximum throughput. Aider (interactive test refinement)
Module refactoring Claude Code Deep reasoning needed. Multi-file changes. Worktree isolation built in. Cursor v3 (visual diff review)
Security scanning Semgrep + Snyk Static analysis tools catch what agents miss. Run as CI gates on every agent PR. SonarQube
Orchestration amux Shared task board, inter-session messaging, mobile monitoring, crash recovery DIY tmux + scripts

Week 1: Map the Codebase

Week 1 — Days 1–2

Architectural Inventory

Before changing anything, you need to know what you have. Launch a Claude Code session and ask it to produce an architectural inventory of the codebase. This is the single most valuable task for an AI agent in legacy work — what takes a developer weeks of code reading takes an agent hours.

The inventory should include:

## Example prompt for Claude Code

Read this entire codebase and produce an architectural inventory. For each
top-level directory/module:
1. Its purpose (one sentence)
2. Key files and entry points
3. Dependencies (other modules it imports from)
4. External dependencies (third-party packages)
5. Test coverage (has tests / no tests / partial)
6. Language version and framework version

Then identify: circular dependencies, dead code candidates (unreachable
functions/routes), and the 5 riskiest modules (most complex, least tested,
most dependencies).
Sessions: 1 Claude Code session. Estimated time: 2–4 hours for a 200K-line codebase. Cost: ~$5–15 in tokens.
Week 1 — Days 3–4

Dependency Analysis and Dead Code Removal

Use ast-grep and Claude Code together to map internal dependencies at the function/class level. Then use an agent to remove confirmed dead code — this is the safest first change because deleting unused code cannot break anything that was actually running.

Run 2–3 parallel agents via amux, each targeting a different module’s dead code:

## Agent 1: Remove dead code in /src/legacy-auth/
## Agent 2: Remove dead code in /src/legacy-billing/
## Agent 3: Remove dead code in /src/legacy-reports/

Each agent: analyze imports, find unreachable functions, remove them,
run existing tests to confirm nothing breaks. Commit each removal
separately with a clear message.
Sessions: 2–3 parallel Claude Code sessions. Estimated time: 4–8 hours. Cost: ~$10–30.
Week 1 — Day 5

Modernization Plan

With the architectural inventory and clean dependency graph in hand, produce the modernization plan. This is the one step that must be human-driven — the agent provides data, you make the decisions:

Sessions: 1 Claude Code session (for generating the plan draft). Human time: 2–4 hours reviewing and adjusting.

Week 2: Generate Test Coverage

This is the most important week. You cannot safely refactor code that has no tests. Week 2 is not glamorous, but it is the foundation for everything that follows. Every hour invested in test generation saves 3–5 hours of debugging during refactoring.
Week 2 — Days 1–3

Characterization Tests

Characterization tests capture what the code actually does, not what it should do. Even if the code has bugs, the tests document the current behavior. This is critical: during refactoring, you need to know whether a behavior change was intentional or accidental.

Run 4–6 agents in parallel, each generating characterization tests for a different module:

## Prompt template for each agent

Read /src/MODULE_NAME/ and generate characterization tests using
FRAMEWORK (pytest/Jest/PHPUnit).

For every public function and API endpoint:
1. Write a test that captures the current input/output behavior
2. Include edge cases you can identify from the code
3. Test error paths and exception handling
4. Do NOT fix bugs — document the current behavior, even if it looks wrong
5. Add a comment "# Characterization test — captures legacy behavior"
   to distinguish from intentional tests

Run the tests and ensure they all pass against the current code.

Use pytest for Python, Jest for JavaScript/TypeScript, PHPUnit for PHP, JUnit for Java. Match the testing framework to the ecosystem.

Sessions: 4–6 parallel agents via amux. Estimated time: 8–16 hours. Cost: ~$30–80. Expected result: 60–80% code coverage for modules that previously had near-zero coverage.
Week 2 — Days 4–5

Integration Tests and Human Review

Characterization tests cover individual modules. Integration tests cover the interactions between modules — this is where legacy systems hide their worst bugs. Generate integration tests for the critical paths identified in Week 1’s architectural inventory.

Then spend Day 5 reviewing the generated tests. Look for:

Sessions: 2–3 parallel agents for integration tests. Human review: 4–6 hours on Day 5.
Do not skip test review. AI-generated tests can give false confidence. A SonarSource study found that 48% of developers do not verify AI-generated code before committing. For legacy modernization, unreviewed tests are worse than no tests — they create a false safety net that hides regressions. Spend the time on Day 5. It is the highest-leverage review of the entire project.

Week 3: Refactor Module by Module

This is the Strangler Fig pattern in action. Each agent refactors one module in its own git worktree while the legacy system continues running. When the module’s tests pass, you merge the refactored version. The legacy system gradually transforms into the modern one, one module at a time.

Week 3 — Days 1–4

Parallel Refactoring Sprint

Launch 4–8 agents, each in its own git worktree, each refactoring a different module according to the modernization plan from Week 1:

## Example: 6 agents refactoring 6 modules

Agent 1 (worktree: wt-auth)    → Refactor /src/auth/ from PHP 5 → PHP 8
Agent 2 (worktree: wt-billing) → Refactor /src/billing/ from jQuery → React
Agent 3 (worktree: wt-reports) → Refactor /src/reports/ from callbacks → async/await
Agent 4 (worktree: wt-api)     → Refactor /src/api/ from REST v1 → REST v3
Agent 5 (worktree: wt-models)  → Refactor /src/models/ from raw SQL → ORM
Agent 6 (worktree: wt-config)  → Refactor /src/config/ from env files → typed config

Each agent:
1. Create a feature branch in its worktree
2. Refactor the module according to the spec
3. Run the module's characterization tests — they MUST still pass
4. Run the integration tests that touch this module
5. Commit with a clear message describing the migration

Monitor all sessions from the amux dashboard or the mobile PWA. When an agent finishes, review its diff and either approve or redirect.

Sessions: 4–8 parallel agents. Estimated time: 16–32 hours of agent runtime (2–4 days with overnight runs). Human time: 2–3 hours/day reviewing diffs. Cost: ~$80–200.
Week 3 — Day 5

Conflict Resolution and Second Pass

Some modules will have shared interfaces. When Agent 2 changes a function signature that Agent 4 also calls, you get a conflict. This is expected and manageable:

  1. Merge the module with the most downstream dependencies first (usually the models/data layer)
  2. Rebase other agent branches onto the updated main
  3. Send each rebased agent a message to fix the conflicts: “The function signature for get_user() changed. Update your module to match the new interface and re-run tests.”
  4. Run the full integration test suite after each merge
Sessions: 2–4 agents for conflict resolution. Human time: 4–6 hours for merge order decisions and integration testing.

Week 4: Integrate, Validate, and Ship

Week 4 — Days 1–2

Full Integration Testing

With all refactored modules merged, run the complete test suite. Fix failures using focused agent sessions — each failure goes to an agent with the specific test, the error message, and the relevant module code.

Run Semgrep and Snyk security scans against the refactored codebase. AI agents can introduce subtle security regressions during refactoring — static analysis catches what tests miss.

Week 4 — Days 3–4

Documentation and Migration Guide

Generate documentation for the modernized codebase. This is where AI agents shine — they are tireless documenters. Run 2–3 agents:

Week 4 — Day 5

Ship

Deploy the modernized codebase. For most teams, this means a blue-green deployment or feature-flagged rollout — the legacy system stays running until the modernized version is validated in production.

Parallel Agent Orchestration

The difference between “using AI for legacy modernization” and “finishing a legacy modernization in four weeks” is parallelism. Here is how to orchestrate it:

Git worktree isolation

Every agent must work in its own git worktree. This is non-negotiable — without it, agents will overwrite each other’s changes. Claude Code has built-in worktree support. With amux, each session automatically gets its own worktree.

Shared task board

Use amux’s kanban board to track which module each agent is working on. Statuses: todo (not started), doing (agent working), review (waiting for human review), done (merged). This gives you a single view of the entire modernization sprint — from your desktop or from the mobile PWA on your phone.

Merge order protocol

Not all modules can be merged in any order. The rule: merge bottom-up through the dependency graph. Data models and shared utilities first, then the modules that depend on them, then top-level routes and controllers last. This minimizes rebase conflicts and ensures each merge has a stable foundation.

The orchestration overhead is small. For a 6-agent parallel refactoring sprint, expect 2–3 hours per day of human overhead: reviewing agent diffs, resolving merge conflicts, and validating test results. The agents do 90% of the work; you make the judgment calls.

Cost Breakdown

DIY with AI agents

4-week modernization sprint

$400 – $1,600
  • Claude Code Max: $100–200/mo per developer
  • Additional API tokens for parallel agents: $50–200/mo
  • Static analysis tools (Semgrep, Snyk): $0–100/mo (free tiers available)
  • Human time: 40–60 hours over 4 weeks (review, decisions, validation)
Modernization consultancy

Comparable engagement

$150K – $500K+
  • Team of 4–6 consultants: $200–350/hour
  • Duration: 3–6 months
  • Project management overhead: 20–30% of budget
  • Risk of scope creep and timeline overrun: high

Cost estimates assume a 200K–500K line codebase. For larger codebases (1M+), multiply the DIY cost by the number of 4-week cycles needed (typically 2–3 cycles). Consultancy costs scale roughly linearly with codebase size.

Safety Guardrails

The number one risk in AI-assisted legacy modernization is breaking production. These four safety layers prevent it:

Layer 1: Characterization tests (from Week 2)

Every refactored module must pass its characterization tests. If a test fails after refactoring, either the refactoring introduced a regression (fix it) or the test was capturing a bug that the refactoring correctly eliminated (update the test with a comment explaining why).

Layer 2: Git worktree isolation

Agents cannot touch each other’s work. Each agent operates in its own worktree on its own branch. Changes only reach main through reviewed, tested pull requests.

Layer 3: CI gates

Every agent PR triggers the full test suite, Semgrep security scan, and ESLint/SonarQube code quality checks. The PR cannot merge if any gate fails. This is the same CI pipeline you use for human PRs — agents do not get special treatment.

Layer 4: Incremental merges

Merge one module at a time. Run integration tests after each merge. Only proceed to the next module if everything passes. If a merge breaks integration, revert it, fix the agent’s output, and re-merge. This sounds slow, but it is far faster than debugging a big-bang merge of 6 modules at once.

Claude Code hooks for legacy refactoring

Add hooks that enforce safety during refactoring sessions:

// .claude/settings.json — PreToolUse hook
{
  "hooks": {
    "PreToolUse": [{
      "matcher": "Write|Edit",
      "command": "python3 -c \"import ast; ast.parse(open('$FILE').read())\" 2>&1",
      "description": "Block writes that produce invalid syntax"
    }]
  }
}

See the safety checklist for more enforcement patterns.

Scaling Beyond 500K Lines

For codebases above 500K lines, a single four-week sprint is not enough. Scale with repeated cycles:

Codebase size Cycles Total timeline Parallel agents Estimated cost (DIY)
100K–200K lines 1 cycle 4 weeks 4–6 $400–800
200K–500K lines 1–2 cycles 4–8 weeks 6–8 $800–1,600
500K–1M lines 2–3 cycles 8–12 weeks 8–10 $1,600–3,200
1M–2M+ lines 3–5 cycles 12–20 weeks 10+ $3,200–8,000

Each cycle targets a different subsystem. Use the first cycle’s architectural inventory to prioritize: start with the subsystem that has the most technical debt, the most pain, or the most upcoming feature work that would benefit from a modern foundation.

For codebases above 1M lines, consider Red Hat’s agent mesh approach: specialized agents for different roles (code analysis, test generation, refactoring, security review) rather than generalist agents that do everything. amux’s shared board coordinates these specialist agents through task assignment and status tracking.

Anti-Patterns That Kill Legacy Modernization Projects

Big bang rewrite

Rewriting the entire codebase at once, then switching over. This fails 50–70% of the time because requirements drift during the rewrite, the new system has its own bugs, and there is no incremental validation. AI agents make this temptation worse (“it’s so fast, let’s just rewrite everything!”). Use the Strangler Fig pattern instead. Module by module. One merge at a time.

Refactoring without tests

Skipping Week 2 because “we’ll just run it and see if it works.” Legacy code has implicit behavior that is not obvious from reading the code. Without characterization tests, you will not know whether a refactoring changed behavior until a customer reports a bug. Test generation is the foundation. Do not skip it.

Trusting agent output without review

AI agents produce plausible-looking code that passes tests but introduces subtle regressions. Cognitive debt research shows that AI-co-authored code has 1.7x more issues and 2.74x more security vulnerabilities. Every agent PR gets human review. The review can be faster than reviewing human code (agents produce consistent style), but it cannot be skipped.

Modernizing everything at once

Trying to upgrade the language version AND migrate the framework AND restructure the architecture in one pass. Each of these is a separate project. Stack them sequentially: language upgrade first (smallest blast radius), framework migration second, architecture changes last. AI agents are fast, but compounding three types of change in one refactoring makes failures undiagnosable.

No merge order discipline

Merging refactored modules in random order creates cascading conflicts. Always merge bottom-up through the dependency graph: shared utilities and data models first, then the modules that depend on them, then top-level entry points last. amux’s kanban board makes this visible: columns for todo, doing, review, and done map directly to the merge pipeline.

FAQ

What languages work best for AI-assisted legacy modernization?

Python, JavaScript/TypeScript, and Java have the best results because AI models have seen the most training data for these languages. PHP, Ruby, and C# also work well. COBOL, Fortran, and other rare languages produce mixed results — models can read the code but struggle with idiomatic transformations. For rare languages, use the agent for analysis and test generation, but handle the refactoring more interactively with Aider where you guide each change.

Can I use this approach for a monolith-to-microservices migration?

Partially. AI agents can identify service boundaries, extract modules, and set up new service scaffolding. But the architectural decisions (how to split the data model, how to handle cross-service transactions, what communication pattern to use) require human judgment. Use this four-week approach for each service extraction: map the module, generate its tests, refactor it into a standalone service, and validate. The overall microservices strategy is a multi-month project with this guide applied in cycles.

How do I handle a codebase with zero existing tests?

This is the most common scenario for legacy code, and it is exactly what Week 2 addresses. Start by installing a test framework (pytest, Jest, etc.) and running the first characterization test agent on the simplest module. Expect to iterate: the first round of generated tests will have issues, but each module gets easier as the agent builds familiarity with the codebase’s patterns. Aim for 60–80% coverage on critical paths, not 100% on everything.

What if the legacy codebase is too large for an AI context window?

Claude Code handles this automatically through context compaction — it reads files on demand rather than loading the entire codebase. For the Week 1 architectural inventory, point the agent at the top-level directory and let it traverse the tree. It will read files as needed, building its understanding incrementally. For very large codebases (1M+ lines), split the inventory task across multiple agents, each covering a different subsystem. See our context engineering guide for strategies on managing agent context at scale.

Is this approach suitable for regulated industries (healthcare, finance)?

Yes, with additional guardrails. For regulated codebases: (1) Run agents against a local/private model if data cannot leave your network (use Aider with a self-hosted model via AWS Bedrock), (2) Add compliance-specific characterization tests in Week 2 (regulatory calculations, audit trail behavior, data handling paths), (3) Require two-person review for agent PRs that touch compliance-critical modules, and (4) Keep a detailed audit log of every agent-generated change (amux session logs provide this automatically). See our security hardening guide for more.

How does this compare to Morgan Stanley’s approach?

Morgan Stanley built a custom AI tool that converted 9 million lines of code, saving an estimated 280,000 developer hours. Their approach was enterprise-scale: custom-trained models, dedicated infrastructure, months of development on the tooling itself. The approach in this guide uses off-the-shelf AI coding agents and is designed for teams of 1–5 developers working on 100K–2M line codebases. You trade Morgan Stanley’s scale for speed and cost: they spent millions building bespoke tooling; you spend $400–1,600 using existing tools in a structured workflow.


Further Reading