AI Agents for Legacy Code Modernization: The Developer’s Week-by-Week DIY Guide
Every guide on AI legacy modernization is written by consultancies selling six-figure engagements. This one is written for developers who want to do it themselves. A four-week playbook: map a legacy codebase with AI agents, generate test coverage from near-zero, refactor module by module using the Strangler Fig pattern, and ship — with parallel agent workflows, safety guardrails, and realistic cost estimates for codebases from 100K to 2M+ lines.
Why AI Agents Change the Legacy Modernization Equation
Legacy modernization has always been a math problem with terrible numbers. Rewriting a 300K-line codebase takes 12–18 months with a team of 4–6 developers. Hiring a consultancy costs $150,000–500,000+. The project has a 50–70% failure rate because requirements drift during the rewrite. So teams live with the legacy code, adding duct tape until the next person inherits the mess.
AI coding agents break this equation in three ways:
1. Agents read legacy code without complaining
A developer staring at 200K lines of undocumented PHP 5 will spend weeks building a mental model. Claude Code with Opus 4.6 can ingest the full context, map dependencies, and produce an architectural inventory in hours. It does not need motivation, onboarding, or context-switching time. It will not quit.
2. Agents generate tests for code that was never tested
The biggest blocker in legacy modernization is not the refactoring — it is the lack of tests. You cannot safely refactor code that has no tests. But writing tests for legacy code is mind-numbing, tedious work that developers avoid. AI agents will happily generate hundreds of characterization tests that capture existing behavior, giving you the safety net needed before changing anything.
3. Parallel agents turn a serial project into a parallel one
A human team refactors one module at a time. An amux fleet of 4–8 agents refactors 4–8 modules simultaneously, each in its own git worktree, coordinated through a shared task board. What took 3 months sequentially takes 3 weeks in parallel.
What AI Agents Can and Cannot Modernize
AI agents are not magic. They excel at pattern-based transformations and struggle with architectural judgment. Here is the honest breakdown:
AI agents excel at
- Language version upgrades — Python 2→3, Java 8→21, PHP 5→8, Ruby 2→3
- Framework migrations — jQuery→React, AngularJS→Angular, Rails 5→7, Django 2→5
- Dependency updates — resolving outdated packages, fixing breaking changes, removing vulnerabilities
- Dead code removal — identifying and removing unused functions, imports, routes, and config
- Test generation — writing unit, integration, and characterization tests for untested code
- API modernization — SOAP→REST, REST→GraphQL, callback→async/await
- Code style normalization — enforcing consistent formatting, naming conventions, and linting rules
- Documentation generation — producing inline docs, API references, and architecture diagrams from undocumented code
AI agents struggle with
- Architectural redesigns — monolith→microservices requires human judgment about service boundaries, data ownership, and team topology
- Business logic rewrites — agents can migrate syntax but cannot validate that a pricing calculation, compliance rule, or workflow is still correct without explicit specs
- Database schema migrations — changing data models affects every query, report, and integration; agents need human-defined migration plans
- Cross-system integration changes — when the legacy system talks to 15 other services, changing interfaces requires coordination beyond the codebase
print statements → Python 3 print() functions), agents handle it reliably. If the transformation requires understanding why the business logic exists and whether it should change, you need a human making the call and an agent executing it.
Tool Selection: Which Agent for Which Phase
| Phase | Best tool | Why | Alternative |
|---|---|---|---|
| Codebase mapping | Claude Code | Opus 4.6 excels at reading and reasoning about unfamiliar codebases. 200K context window handles large files. | Cursor (visual navigation) |
| Dependency analysis | Claude Code + ast-grep | Combine AI reasoning with AST-based search for structural dependency mapping | Dependabot (for package deps) |
| Test generation | Claude Code or Codex | Both handle batch test generation well. Run in parallel for maximum throughput. | Aider (interactive test refinement) |
| Module refactoring | Claude Code | Deep reasoning needed. Multi-file changes. Worktree isolation built in. | Cursor v3 (visual diff review) |
| Security scanning | Semgrep + Snyk | Static analysis tools catch what agents miss. Run as CI gates on every agent PR. | SonarQube |
| Orchestration | amux | Shared task board, inter-session messaging, mobile monitoring, crash recovery | DIY tmux + scripts |
Week 1: Map the Codebase
Architectural Inventory
Before changing anything, you need to know what you have. Launch a Claude Code session and ask it to produce an architectural inventory of the codebase. This is the single most valuable task for an AI agent in legacy work — what takes a developer weeks of code reading takes an agent hours.
The inventory should include:
- Module map — every major directory, its purpose, and its entry points
- Dependency graph — which modules depend on which, including circular dependencies
- Technology inventory — language versions, frameworks, libraries, and their ages
- Test coverage — which modules have tests and which have zero coverage
- Dead code candidates — functions, routes, and files that appear unreachable
## Example prompt for Claude Code
Read this entire codebase and produce an architectural inventory. For each
top-level directory/module:
1. Its purpose (one sentence)
2. Key files and entry points
3. Dependencies (other modules it imports from)
4. External dependencies (third-party packages)
5. Test coverage (has tests / no tests / partial)
6. Language version and framework version
Then identify: circular dependencies, dead code candidates (unreachable
functions/routes), and the 5 riskiest modules (most complex, least tested,
most dependencies).
Dependency Analysis and Dead Code Removal
Use ast-grep and Claude Code together to map internal dependencies at the function/class level. Then use an agent to remove confirmed dead code — this is the safest first change because deleting unused code cannot break anything that was actually running.
Run 2–3 parallel agents via amux, each targeting a different module’s dead code:
## Agent 1: Remove dead code in /src/legacy-auth/
## Agent 2: Remove dead code in /src/legacy-billing/
## Agent 3: Remove dead code in /src/legacy-reports/
Each agent: analyze imports, find unreachable functions, remove them,
run existing tests to confirm nothing breaks. Commit each removal
separately with a clear message.
Modernization Plan
With the architectural inventory and clean dependency graph in hand, produce the modernization plan. This is the one step that must be human-driven — the agent provides data, you make the decisions:
- Module priority order — which modules to modernize first (start with the least-coupled, best-tested modules)
- Migration targets — what each module should look like after modernization (new framework, new API style, new patterns)
- Risk assessment — which modules need careful human review vs. which can be agent-driven
- Parallelism plan — which modules can be refactored simultaneously (no shared dependencies)
Week 2: Generate Test Coverage
Characterization Tests
Characterization tests capture what the code actually does, not what it should do. Even if the code has bugs, the tests document the current behavior. This is critical: during refactoring, you need to know whether a behavior change was intentional or accidental.
Run 4–6 agents in parallel, each generating characterization tests for a different module:
## Prompt template for each agent
Read /src/MODULE_NAME/ and generate characterization tests using
FRAMEWORK (pytest/Jest/PHPUnit).
For every public function and API endpoint:
1. Write a test that captures the current input/output behavior
2. Include edge cases you can identify from the code
3. Test error paths and exception handling
4. Do NOT fix bugs — document the current behavior, even if it looks wrong
5. Add a comment "# Characterization test — captures legacy behavior"
to distinguish from intentional tests
Run the tests and ensure they all pass against the current code.
Use pytest for Python, Jest for JavaScript/TypeScript, PHPUnit for PHP, JUnit for Java. Match the testing framework to the ecosystem.
Integration Tests and Human Review
Characterization tests cover individual modules. Integration tests cover the interactions between modules — this is where legacy systems hide their worst bugs. Generate integration tests for the critical paths identified in Week 1’s architectural inventory.
Then spend Day 5 reviewing the generated tests. Look for:
- False confidence — tests that always pass because they test the wrong thing
- Missing edge cases — boundary conditions that the agent could not infer from code alone
- Business logic gaps — tests that capture the how but miss the why (add intent comments yourself)
Week 3: Refactor Module by Module
This is the Strangler Fig pattern in action. Each agent refactors one module in its own git worktree while the legacy system continues running. When the module’s tests pass, you merge the refactored version. The legacy system gradually transforms into the modern one, one module at a time.
Parallel Refactoring Sprint
Launch 4–8 agents, each in its own git worktree, each refactoring a different module according to the modernization plan from Week 1:
## Example: 6 agents refactoring 6 modules
Agent 1 (worktree: wt-auth) → Refactor /src/auth/ from PHP 5 → PHP 8
Agent 2 (worktree: wt-billing) → Refactor /src/billing/ from jQuery → React
Agent 3 (worktree: wt-reports) → Refactor /src/reports/ from callbacks → async/await
Agent 4 (worktree: wt-api) → Refactor /src/api/ from REST v1 → REST v3
Agent 5 (worktree: wt-models) → Refactor /src/models/ from raw SQL → ORM
Agent 6 (worktree: wt-config) → Refactor /src/config/ from env files → typed config
Each agent:
1. Create a feature branch in its worktree
2. Refactor the module according to the spec
3. Run the module's characterization tests — they MUST still pass
4. Run the integration tests that touch this module
5. Commit with a clear message describing the migration
Monitor all sessions from the amux dashboard or the mobile PWA. When an agent finishes, review its diff and either approve or redirect.
Conflict Resolution and Second Pass
Some modules will have shared interfaces. When Agent 2 changes a function signature that Agent 4 also calls, you get a conflict. This is expected and manageable:
- Merge the module with the most downstream dependencies first (usually the models/data layer)
- Rebase other agent branches onto the updated main
- Send each rebased agent a message to fix the conflicts:
“The function signature for get_user() changed. Update your module to match the new interface and re-run tests.” - Run the full integration test suite after each merge
Week 4: Integrate, Validate, and Ship
Full Integration Testing
With all refactored modules merged, run the complete test suite. Fix failures using focused agent sessions — each failure goes to an agent with the specific test, the error message, and the relevant module code.
Run Semgrep and Snyk security scans against the refactored codebase. AI agents can introduce subtle security regressions during refactoring — static analysis catches what tests miss.
Documentation and Migration Guide
Generate documentation for the modernized codebase. This is where AI agents shine — they are tireless documenters. Run 2–3 agents:
- Agent 1: Generate API documentation for all public interfaces
- Agent 2: Write a migration guide documenting every breaking change
- Agent 3: Update README, CHANGELOG, and deployment instructions
Ship
Deploy the modernized codebase. For most teams, this means a blue-green deployment or feature-flagged rollout — the legacy system stays running until the modernized version is validated in production.
- All characterization tests pass
- All integration tests pass
- Security scans clean (or known issues documented)
- Migration guide reviewed and approved
- Rollback plan documented and tested
- Monitoring/alerting configured for the new code paths
Parallel Agent Orchestration
The difference between “using AI for legacy modernization” and “finishing a legacy modernization in four weeks” is parallelism. Here is how to orchestrate it:
Git worktree isolation
Every agent must work in its own git worktree. This is non-negotiable — without it, agents will overwrite each other’s changes. Claude Code has built-in worktree support. With amux, each session automatically gets its own worktree.
Shared task board
Use amux’s kanban board to track which module each agent is working on. Statuses: todo (not started), doing (agent working), review (waiting for human review), done (merged). This gives you a single view of the entire modernization sprint — from your desktop or from the mobile PWA on your phone.
Merge order protocol
Not all modules can be merged in any order. The rule: merge bottom-up through the dependency graph. Data models and shared utilities first, then the modules that depend on them, then top-level routes and controllers last. This minimizes rebase conflicts and ensures each merge has a stable foundation.
Cost Breakdown
4-week modernization sprint
- Claude Code Max: $100–200/mo per developer
- Additional API tokens for parallel agents: $50–200/mo
- Static analysis tools (Semgrep, Snyk): $0–100/mo (free tiers available)
- Human time: 40–60 hours over 4 weeks (review, decisions, validation)
Comparable engagement
- Team of 4–6 consultants: $200–350/hour
- Duration: 3–6 months
- Project management overhead: 20–30% of budget
- Risk of scope creep and timeline overrun: high
Cost estimates assume a 200K–500K line codebase. For larger codebases (1M+), multiply the DIY cost by the number of 4-week cycles needed (typically 2–3 cycles). Consultancy costs scale roughly linearly with codebase size.
Safety Guardrails
The number one risk in AI-assisted legacy modernization is breaking production. These four safety layers prevent it:
Layer 1: Characterization tests (from Week 2)
Every refactored module must pass its characterization tests. If a test fails after refactoring, either the refactoring introduced a regression (fix it) or the test was capturing a bug that the refactoring correctly eliminated (update the test with a comment explaining why).
Layer 2: Git worktree isolation
Agents cannot touch each other’s work. Each agent operates in its own worktree on its own branch. Changes only reach main through reviewed, tested pull requests.
Layer 3: CI gates
Every agent PR triggers the full test suite, Semgrep security scan, and ESLint/SonarQube code quality checks. The PR cannot merge if any gate fails. This is the same CI pipeline you use for human PRs — agents do not get special treatment.
Layer 4: Incremental merges
Merge one module at a time. Run integration tests after each merge. Only proceed to the next module if everything passes. If a merge breaks integration, revert it, fix the agent’s output, and re-merge. This sounds slow, but it is far faster than debugging a big-bang merge of 6 modules at once.
Claude Code hooks for legacy refactoring
Add hooks that enforce safety during refactoring sessions:
// .claude/settings.json — PreToolUse hook
{
"hooks": {
"PreToolUse": [{
"matcher": "Write|Edit",
"command": "python3 -c \"import ast; ast.parse(open('$FILE').read())\" 2>&1",
"description": "Block writes that produce invalid syntax"
}]
}
}
See the safety checklist for more enforcement patterns.
Scaling Beyond 500K Lines
For codebases above 500K lines, a single four-week sprint is not enough. Scale with repeated cycles:
| Codebase size | Cycles | Total timeline | Parallel agents | Estimated cost (DIY) |
|---|---|---|---|---|
| 100K–200K lines | 1 cycle | 4 weeks | 4–6 | $400–800 |
| 200K–500K lines | 1–2 cycles | 4–8 weeks | 6–8 | $800–1,600 |
| 500K–1M lines | 2–3 cycles | 8–12 weeks | 8–10 | $1,600–3,200 |
| 1M–2M+ lines | 3–5 cycles | 12–20 weeks | 10+ | $3,200–8,000 |
Each cycle targets a different subsystem. Use the first cycle’s architectural inventory to prioritize: start with the subsystem that has the most technical debt, the most pain, or the most upcoming feature work that would benefit from a modern foundation.
For codebases above 1M lines, consider Red Hat’s agent mesh approach: specialized agents for different roles (code analysis, test generation, refactoring, security review) rather than generalist agents that do everything. amux’s shared board coordinates these specialist agents through task assignment and status tracking.
Anti-Patterns That Kill Legacy Modernization Projects
Big bang rewrite
Rewriting the entire codebase at once, then switching over. This fails 50–70% of the time because requirements drift during the rewrite, the new system has its own bugs, and there is no incremental validation. AI agents make this temptation worse (“it’s so fast, let’s just rewrite everything!”). Use the Strangler Fig pattern instead. Module by module. One merge at a time.
Refactoring without tests
Skipping Week 2 because “we’ll just run it and see if it works.” Legacy code has implicit behavior that is not obvious from reading the code. Without characterization tests, you will not know whether a refactoring changed behavior until a customer reports a bug. Test generation is the foundation. Do not skip it.
Trusting agent output without review
AI agents produce plausible-looking code that passes tests but introduces subtle regressions. Cognitive debt research shows that AI-co-authored code has 1.7x more issues and 2.74x more security vulnerabilities. Every agent PR gets human review. The review can be faster than reviewing human code (agents produce consistent style), but it cannot be skipped.
Modernizing everything at once
Trying to upgrade the language version AND migrate the framework AND restructure the architecture in one pass. Each of these is a separate project. Stack them sequentially: language upgrade first (smallest blast radius), framework migration second, architecture changes last. AI agents are fast, but compounding three types of change in one refactoring makes failures undiagnosable.
No merge order discipline
Merging refactored modules in random order creates cascading conflicts. Always merge bottom-up through the dependency graph: shared utilities and data models first, then the modules that depend on them, then top-level entry points last. amux’s kanban board makes this visible: columns for todo, doing, review, and done map directly to the merge pipeline.
FAQ
What languages work best for AI-assisted legacy modernization?
Python, JavaScript/TypeScript, and Java have the best results because AI models have seen the most training data for these languages. PHP, Ruby, and C# also work well. COBOL, Fortran, and other rare languages produce mixed results — models can read the code but struggle with idiomatic transformations. For rare languages, use the agent for analysis and test generation, but handle the refactoring more interactively with Aider where you guide each change.
Can I use this approach for a monolith-to-microservices migration?
Partially. AI agents can identify service boundaries, extract modules, and set up new service scaffolding. But the architectural decisions (how to split the data model, how to handle cross-service transactions, what communication pattern to use) require human judgment. Use this four-week approach for each service extraction: map the module, generate its tests, refactor it into a standalone service, and validate. The overall microservices strategy is a multi-month project with this guide applied in cycles.
How do I handle a codebase with zero existing tests?
This is the most common scenario for legacy code, and it is exactly what Week 2 addresses. Start by installing a test framework (pytest, Jest, etc.) and running the first characterization test agent on the simplest module. Expect to iterate: the first round of generated tests will have issues, but each module gets easier as the agent builds familiarity with the codebase’s patterns. Aim for 60–80% coverage on critical paths, not 100% on everything.
What if the legacy codebase is too large for an AI context window?
Claude Code handles this automatically through context compaction — it reads files on demand rather than loading the entire codebase. For the Week 1 architectural inventory, point the agent at the top-level directory and let it traverse the tree. It will read files as needed, building its understanding incrementally. For very large codebases (1M+ lines), split the inventory task across multiple agents, each covering a different subsystem. See our context engineering guide for strategies on managing agent context at scale.
Is this approach suitable for regulated industries (healthcare, finance)?
Yes, with additional guardrails. For regulated codebases: (1) Run agents against a local/private model if data cannot leave your network (use Aider with a self-hosted model via AWS Bedrock), (2) Add compliance-specific characterization tests in Week 2 (regulatory calculations, audit trail behavior, data handling paths), (3) Require two-person review for agent PRs that touch compliance-critical modules, and (4) Keep a detailed audit log of every agent-generated change (amux session logs provide this automatically). See our security hardening guide for more.
How does this compare to Morgan Stanley’s approach?
Morgan Stanley built a custom AI tool that converted 9 million lines of code, saving an estimated 280,000 developer hours. Their approach was enterprise-scale: custom-trained models, dedicated infrastructure, months of development on the tooling itself. The approach in this guide uses off-the-shelf AI coding agents and is designed for teams of 1–5 developers working on 100K–2M line codebases. You trade Morgan Stanley’s scale for speed and cost: they spent millions building bespoke tooling; you spend $400–1,600 using existing tools in a structured workflow.
Further Reading
- How to Review AI-Generated Code — structured morning review workflows for agent fleets
- AI Coding Agent Safety Checklist — 25 rules for safe agent operation
- Context Engineering — structuring what agents see so they produce reliable output
- Running 10+ Agents — the getting-started guide for parallel agent orchestration
- Git Conflict Avoidance — worktree strategies for parallel agents
- Spec-Driven Development — write specs first, let agents implement against them
- AI-Generated Technical Debt — how to prevent agents from creating new debt while fixing old debt
- AI Agent Quality Gates — automated validation pipelines for AI-generated code
- AI Coding Tools Compared (2026) — choosing the right tools for the job