What Claude Gets Right About COBOL Modernization — And What LLMs Still Can’t Do

April 7, 2026

Introduction

In February 2026, Anthropic argued that Claude Code could meaningfully reduce the cost of COBOL modernization. Markets reacted accordingly — IBM fell about 13% that day. The argument is partly right. AI is genuinely useful for analysis, discovery, and documentation. But translating COBOL is not a code generation problem. It’s a translation problem — reproducing existing logic with perfect fidelity in another language (semantic equivalence). And the evidence from our own testing, an independent Swimm study, and Anthropic’s accidentally exposed source code all point to the same conclusion.

What We Tested

We ran Claude Code against four COBOL programs from our library, progressing from a 200-line batch program with no CICS to a 4,229-line program with dozens of EXEC CICS commands and hundreds of complex constructs. Swimm independently tested Claude Code on two real Medicare COBOL programs — 4,692 and 18,835 lines — running each three times with approximately 70 tool calls per run.

Five Failure Patterns

Despite differences in scale and domain, the results converged on five consistent failure patterns. While the sample size is limited, these failures trace to architectural constraints rather than program-specific edge cases — a point the leaked source confirms. At smaller scales, Claude produces readable, plausible code that can often be manually validated. The failure modes become material as program size, statefulness, and dependency complexity increase.

1. Silent data corruption

Wrong values that look right. In our tests: fixed-length COBOL files read with line-oriented Java I/O, financial fields parsed with inconsistent byte widths, NUMVAL-C semantics approximated with string replacement, unparseable values silently converted to zero.

2. Missing business logic

Logic that exists in the source but is absent from the output. In our tests: a COBOL bug correctly identified — then reimplemented rather than preserved. Legacy bugs are often load-bearing and must be preserved during initial migration. In Swimm’s: eleven dropped conditions, entire subsystems missing from all three runs of the 18,835-line program, and the main payment paragraph returning zero extracted rules.

3. Architectural substitution

A different system, not a translated one. The CICS pseudo-conversational model was replaced with a console loop, TDQ queue writes with file appends, and XCTL program chaining with a boolean flag. Transaction boundaries, commarea persistence, and queue semantics are not present in the output in any form.

4. Unverifiable output

Non-determinism makes reliable validation impractical at scale — even when an output appears correct. Swimm’s three identical runs on the same source produced 140, 199, and 226 business rules — a 42% spread. Without a stable baseline, “just validate the output” assumes you know what correct looks like. In a legacy migration, you often don’t.

5. Fabricated values

Separate from non-determinism, the model generates plausible-but-wrong content. Swimm found nine regulatory dollar amounts that were wrong — off by $12 to $500 — indistinguishable from correct output without independent ground-truth verification.

These failure patterns typically don’t show up in compilation or basic testing. Without deep QA against a known ground truth, they can reach production.

Why These Are Architectural

In March 2026, Anthropic appears to have exposed Claude Code’s full TypeScript source through an npm package sourcemap. Independent analysis of that source explains why each failure pattern is structural — not a bug to be fixed, but a consequence of how the system is built. Modern agentic pipelines improve retrieval and orchestration, but they do not change the underlying constraints: the model still operates on partial context, without a persistent AST-level representation of the full program.

The absence of a reliable, persistent AST-level representation contributes directly to silent data corruption and missing business logic. The CLAUDE.md analysis states it plainly: “Grep is text matching, not an AST.” An Abstract Syntax Tree is the same structure a deterministic transpiler builds as its first step — parsing source into a tree that captures syntax and semantics before any transformation. Without it, COBOL’s qualified name references, COPY book expansions, PERFORM THRU ranges, and nested procedures are handled by text matching. Missed references, incorrect scoping, and misinterpreted semantics follow directly.

Context window limits and compaction failures cause coverage collapse and architectural substitution. Claude Code operates within ~167,000 tokens of working memory and never holds a full program simultaneously. When the model cannot see the full program, it reconstructs missing structure — which leads directly to architectural substitution. A comment in the leaked autoCompact.ts documents the scale of the problem: 1,279 sessions with 50+ consecutive compaction failures, up to 3,272 in a single session, wasting ~250,000 API calls per day globally (Alex Kim, March 31, 2026).

Stochastic generation causes unverifiable output. The model does not produce the same output from the same input. This is not a bug — it is how LLMs work. Every run is a different reconstruction of what the program might mean.

A ~30% false claims rate contributes to fabricated values. VentureBeat’s reporting on the leaked source cited internal benchmark data showing a false claims rate of approximately 29–30% for the model version powering Claude Code — from internal benchmark notes, not a public Anthropic disclosure, but consistent with Swimm’s empirical findings.

No prompt instruction fully addresses any of these. You cannot prompt your way to an AST, a larger context window, deterministic output, or a lower false claims rate.

The Same Failures Apply to Documentation

Using an LLM to document COBOL before migration is widely attempted — and subject to the same constraints. Context window limits mean large programs are only partially documented. The ~30% false claims rate means specific values and business rules may be fabricated. Non-determinism means two runs produce different documentation of the same program. Documentation errors are less immediately catastrophic than translation errors, but migration decisions and test coverage depend on them. A deterministic documentation engine — working at the AST level, simulating execution, producing structured output from the full source — is the most reliable foundation for a migration.

The Right Architecture

This is not an argument against modernization. Systems should be improved over time. But in regulated environments, migration and modernization are separate phases. The first requirement is behavioral equivalence — producing a system that behaves exactly like the original. In legacy systems, requirements are often incomplete or undocumented — which is why equivalence to the source system becomes the only reliable baseline. Only once that baseline is established can the system be safely refactored and improved.

The practical consequence of choosing the wrong approach shows up in QA. A deterministic transpiler preserves structure, control flow, and data layout. Issues in the output are syntactic — compile errors, runtime scaffolding, mechanical cleanup. Determinism makes discrepancies detectable — which is what makes QA bounded. If a translation tool produces high, measurable accuracy, engineers only need to review a small number of flagged cases. For the four programs we tested, that’s weeks of targeted verification.

LLM output requires a fundamentally different QA posture. Because AI models generate code probabilistically, engineers cannot assume where errors will appear — requiring exhaustive testing. Even with strong prompting and review, the absence of a deterministic baseline means correctness cannot be assumed. Logic is reinterpreted. Architecture is sometimes substituted. Data handling is approximated. As a result, QA is no longer asking “did we preserve behavior” but “what does this system actually do?” That requires full functional regression, test design from scratch, rule extraction and validation, and open-ended fix-and-retest cycles. When correctness cannot be assumed, every line must be scrutinized. In contrast, when behavior cannot be assumed, QA expands into full regression — which typically takes months — and the effort scales with every year of accumulated edge cases in the original system. The problem compounds when documentation was also AI-generated — if the test cases are derived from an unreliable characterization of the original system, validation cannot confirm correctness even when it appears to pass. The difference isn’t just time. It’s whether QA is confirming correctness or discovering it.
The answer is not AI or deterministic tools. It’s a pipeline that uses each where it belongs.

The answer is not AI or deterministic tools. It’s a pipeline that uses each where it belongs.

Step 1: Deterministic translation and documentation. A transpiler that builds an actual AST, resolves symbol tables, and processes the entire file in a single pass produces the same Java output every time from the same COBOL input. Simultaneously, a documentation engine produces structured JSON of both the source COBOL and target Java — capturing control flow, variable states, and branch decisions at the AST level. A deterministic transpiler does not make interpretive choices. Every transformation is traceable back to the source.

Step 2: Deterministic equivalence testing. Determinism alone is not sufficient. The combination of deterministic translation and deterministic equivalence testing — comparing structured representations of source and target behavior — is what provides confidence in correctness. JUnit tests, generated from the same structured inputs, then confirm runtime behavioral correctness. Both tests have ground truth. Both are repeatable.

Step 3: AI assistance, bounded and verifiable. Once equivalence is confirmed, AI is genuinely useful: troubleshooting compile errors, building JUnit tests from structured documentation, and modernizing the Java in phases. Because the output is standard, runtime-free Java, developers and AI coding assistants can read it, refactor it, and incrementally improve it — breaking monoliths into services, replacing legacy patterns with modern ones. Clean, equivalent Java is a foundation AI can actually work from.

Deterministic tooling for the parts where correctness is required. AI for the parts where it’s useful. In modernizing legacy code, accuracy matters more than imagination.

More like this

Why "Documenting" Your COBOL with an LLM Doesn't Solve Anything

The Business Person's Guide to Legacy Application Modernization

Modernizing Legacy Applications: A Decision Framework

Evaluating Legacy Code Refactoring Methods: A CTO's View

Why LLMs Struggle with Legacy Code Refactoring and What’s Next

How Tertiary Language Transpilers Refactor Legacy Code

Why Refactoring Legacy Code Is Hard (And How Transpilers Help)