Their Best-Tested File Was Still Their Biggest Risk

A file with 261 tests should be one of the safest files in your codebase. That’s the deal, right? Write more tests, sleep better at night.

Last year, we ran our analysis platform against a client’s system — 58 repositories, thousands of files, twelve months of git history and bug data from Azure DevOps. We were looking for the files most likely to generate production bugs. The usual suspects: undertested code, high complexity, low coverage.

One file rewrote our assumptions.

The File That Should Have Been Safe

It was a core backend service — the kind that touches everything. Pricing calculations, dashboard logic, API orchestration. If the system had a beating heart, this was it.

261 tests surrounded it. Engineers had invested serious effort into making this file reliable. By every conventional metric — coverage percentage, test count, linting scores — it would have gotten a green checkmark from SonarQube, a thumbs-up from any CI pipeline.

Our analysis flagged it as the second-highest-risk file in the entire system. Score: 83.4 out of 100. Forty-three bugs traced directly back to it over twelve months.

Meanwhile, a file with zero tests and low activity was barely a risk at all. It hadn’t changed in months. No bugs. No surprises. Just sitting there, doing its job.

What was going on?

Test Count Is a Snapshot. Risk Is a Movie.

Here’s the problem with test coverage as a safety metric: it tells you how much coverage existed when someone last wrote tests. It says nothing about what happened since.

This file changed almost every sprint. New requirements meant new code paths — pricing edge cases, dashboard filters, integration points with other services. Each change introduced behavior that the existing 261 tests didn’t cover. The tests were still passing, because they still tested what they were written to test. But the file had moved on.

Think of it like a security camera pointed at the front door while someone walks in through the window. The camera still works. It just doesn’t cover what matters anymore.

The real metric isn’t how many tests exist. It’s the relationship between change velocity and test evolution. If a file changes 30 times a quarter but its test suite only updates twice, coverage is decaying in real time — no matter what the dashboard says.

Introducing the Kill Zone

We built a concept around this: the Kill Zone. It identifies files using two dimensions:

High churn — the file changes frequently
High coupling — when it changes, other files have to change with it

Files in the Kill Zone generate the most bugs regardless of how many tests surround them. They’re moving targets in interconnected systems. Tests can’t keep up because the file won’t sit still, and every change cascades.

Our flagged file was a textbook case. High churn (constant sprint-over-sprint modifications), high coupling (changes in this file forced updates across multiple services), and a bug trail that proved it. The 261 tests were fighting last quarter’s war while this quarter’s requirements had already moved the battlefield.

Static analysis tools measure the code as it exists right now. They’ll tell you about complexity scores, code smells, duplication. All useful. But none of it captures temporal dynamics — how the code evolves over time, how fast it changes relative to its safety net, how deeply its changes propagate.

Why This Matters for Modernization

If you’re running a legacy system — especially a large one with dozens of repositories — you’re constantly making prioritization decisions. Which files to refactor first. Where to invest in testing. What to leave alone.

Most teams make these decisions based on one of two things:

Gut feeling — “That module has always been trouble”
Static metrics — coverage reports, complexity scores, SonarQube dashboards

Neither captures the full picture.

Gut feeling is experience-based and often correct, but it doesn’t scale. The engineer who knows which files are dangerous might leave. The knowledge walks out with them.

Static metrics are measurable and consistent, but they’re snapshots. A file that was well-tested six months ago might be severely exposed today if it’s been changing faster than its tests. And a file with zero tests but zero changes is probably fine — it’s stable, it works, and nobody’s touching it.

What you need is a temporal view: how has this file’s risk profile evolved? Is its test coverage keeping pace with its change rate? When it changes, what else breaks?

The Broader Pattern

This wasn’t an isolated finding. Across the client’s 58 repositories, the correlation between test count and actual risk was weak. Some of the most dangerous files were “well tested.” Some of the least risky had minimal coverage.

The pattern held for other metrics too:

Repository health scores showed 45 out of 58 repos as “high risk” — which meant the metric wasn’t useful without further analysis. When everything is high risk, nothing is high risk. We had to build a hierarchy: problem areas → root causes → affected files → specific actions.
Bug pattern clustering revealed cross-repo patterns invisible to individual teams. Pricing calculation bugs (97 defects) spanned four different services. No single team owned enough context to see the pattern.
Production escape rate was 52% — more than half of all bugs were found by customers, not engineers. The shift-left pipeline was inverted.

The common thread: surface-level metrics were misleading. Not wrong, exactly — just incomplete. They answered “what does the code look like right now?” but not “how is the code behaving over time?”

What To Measure Instead

If you take one thing from this, let it be this: shift your metrics from static to temporal.

Instead of test count → measure the ratio of file changes to test changes. If a file has been modified 15 times this quarter and its tests have been modified once, that’s a coverage decay signal.

Instead of complexity score → measure churn rate relative to coupling. A complex file that never changes is stable. A moderately complex file that changes every sprint and is coupled to 12 other files is a ticking bomb.

Instead of coverage percentage → measure the Kill Zone. High churn × high coupling = highest risk, regardless of how many tests exist.

Instead of repository-level health → measure problem area clustering. Group bugs by business domain, not repository. Cross-repo patterns are where the real systemic issues hide.

These aren’t exotic metrics. The data already exists in your git history, your CI pipeline, and your issue tracker. You just need to connect the dots across time, not just across code.

The Uncomfortable Truth

The team had been ignoring that file for months. Not because they were lazy — because every signal told them it was fine. High test count. Passing CI. No code smell warnings.

It took correlating git history, bug data, and change coupling to see what was actually happening: a file that changed too fast for its safety net, dragging other files with it every time.

The lesson isn’t “tests are useless.” Tests are essential. The lesson is that test count is a lagging indicator and change velocity is a leading one. If you’re prioritizing what to fix based on coverage reports alone, you might be staring at exactly the wrong files.

Legacy systems don’t fail because of what they are. They fail because of how they change.

This analysis was performed using Bayefix, a platform built by Futurify that finds where technical debt actually lives — not just code smells, but the real root causes across code, process, and test coverage.

Managing a legacy .NET or Java codebase? Talk to us before your next rewrite.