Rick Pollick
← All writing
13 min read

Valuestream Episode 2: Trust the Number That Hurts. The Metrics That Lie in the Agentic Era, and What to Measure Instead

Two leaders, two trusted numbers, two metrics that looked great and lied. Episode 2 of Valuestream walks the delivery metrics that were built for a pre-agentic world and quietly stopped working the moment agents started authoring code. The five-metric replacement layer for engineering (outcome latency, reviewer load, rework ratio, time to restore, production confidence) plus dependency density for programs. Full show notes, the framework, the case study, the migration plan, and the Spotify embed.

Valuestream Episode 2: Trust the Number That Hurts. The Metrics That Lie in the Agentic Era, and What to Measure InsteadEpisode 2

Episode 2 of Valuestream is live. This one's about the metrics that have been quietly lying to you since agents started authoring code. DORA on the engineering side. Critical path on the program side. Both were beautiful frameworks for a world where the thing producing the work was always a human. That world is gone, and the math hasn't caught up.

The cost of a metric that lies isn't that it's wrong. It's that it's confident. It walks into the board meeting, flashes green, and buys you another quarter of believing motion is progress. This episode is about what to put on the dashboard instead, plus a real case where the numbers were elite and the program was broken.

Listen on Spotify above, or on Apple Podcasts and the rest once distribution propagates. Companion essay below has the full framework and the case data.

The opening

You and your team deploy a lot. But how much new value did your customers actually see this quarter? Most of the leaders I'm talking to right now don't know how to answer. The dashboard says elite. The gut says something's off. And nobody can explain the gap.

Here's the gap. The number went up. The value didn't. The metric you've been bragging about had quietly stopped measuring the thing folks thought it measured.

A program director walks through a critical path chart. Clean. Green. The longest chain through the program lands two days inside the deadline. Six weeks later that program slipped by nine weeks, and the chart never saw it coming.

Two leaders. Two trusted numbers. Two metrics that looked great and lied.

That's the show today. The metrics that lie to you in the agentic era, and what to put in their place.

Intake: why DORA and critical path are both failing the same way

Most of the metrics we trust to tell us how delivery is going were designed for a world that doesn't exist anymore. And they're failing quietly. Not with an error message. With a green light.

Start with DORA. The research that produced it assumed something simple, and at the time, true. Behind every commit, every pull request, every deploy, there was a roughly fixed amount of human attention. Deployment frequency was a decent proxy for throughput, because each deploy cost real engineering effort. Lead time measured the friction between an engineer finishing work and a customer seeing it. Change failure rate caught the price of moving too fast. Time to restore measured how well your people cleaned up their own mistakes.

Every one of those assumptions breaks the moment an agent can open a pull request without a human authoring it. Deployment frequency goes up, because the cost of producing a change collapses. Lead time falls, because there's no human in the middle waiting to review. The whole framework drifts, because the thing in the denominator, human effort, stops tracking the thing in the numerator, code shipped.

I'm not telling you DORA is dead. I'm telling you that anyone using the original four metrics as their headline number in 2026 is reading a fever chart and calling it a thermometer.

Now critical path. Same kind of broken assumption, different discipline. Critical path was a beautiful tool when programs looked like Gantt charts. A line of tasks, one feeding the next, with a single longest sequence that set the floor for delivery. Drop a task on the chain, the program slips by that much. Clean. Defensible. Easy to put in front of an executive.

The problem is that almost nothing your teams are building right now looks like a chain. A payments cutover touches the ledger, the reconciler, the fraud rules, the rate limiter, three internal APIs, and a vendor SDK that got deprecated last quarter. An agentic AI rollout depends on a feature store, an eval harness, a routing layer, a guardrail service, and a data contract that two upstream teams own and neither one maintains. The longest single chain through that mess is still computable. It just stops predicting when the thing will actually ship.

Two headline metrics, in two different parts of the org, failing the same way. Both were built on an assumption about how work gets done. The work changed underneath them.

Here's the strategic intent for today. The cost of a metric that lies isn't that it's wrong. It's that it's confident. It walks into the board meeting, flashes green, and buys you another quarter of believing motion is progress.

And motion is the trap. An agent can open, review with another agent, and merge dozens of dependency bumps and refactors in a day. Your weekly deploy count looks heroic. Your customers see nothing. A program can show forty work items all marked on track, while the six dependencies between them quietly multiply into a knot nobody can ship through. The chart looks fine. The date is already gone.

The leaders whose 2027 board decks survive scrutiny are the ones doing the harder work now. Measuring what these new tools make easy to hide.

Flow: the five-metric engineering layer

The move isn't to throw DORA out. Keep the original four as compliance artifacts, the numbers you still report up the chain, and add a second layer underneath that measures what agents make harder to see. I run a five-metric replacement layer with my clients. Let me walk it.

1. Outcome latency

The honest replacement for lead time. Lead time measures first commit to production. Outcome latency measures something harder and truer. The time from a human stating an intent (a ticket going in progress, a product brief getting accepted, a customer commitment getting made) to the verified delivery of that outcome in production. Verified. Behind a flag, confirmed by an analytics signal, actually in front of a customer.

It's harder to instrument, because intent lives in messy product systems and verified value needs a real signal. It's also the only number that maps back to value. If you measure one new thing this year, measure that one.

Why does lead time lie in the first place? Three patterns I see in the field, over and over.

Pull request inflation. A team turns on autonomous agents, the weekly pull request count triples, and the reviewers can't keep up. So they rubber stamp, or they ignore. Either path raises your failure rate. The speed gain evaporates the second you skip the verification layer.

Revert churn. An agent ships a change. Tests pass. Two days later a second agent, or a human, notices a regression. A third commit reverts the first. Average the lead times across that little trio and you get a beautifully short number that hides the fact that nothing new is in production. Agents are especially fond of this with refactors. Rename, rename back, slight reorganization, restore. Motion. No progress.

Silent change failure rate inflation. Agents trigger more incidents than humans, but the incidents tend to be narrow. One endpoint. One tenant. One code path. They don't always trip your major incident dashboard. So your headline failure rate looks fine while a dozen small fires burn under the surface.

2. Reviewer load

The count of meaningful review interactions per engineer per week. If your review minutes per engineer fall, your failure rate is about to rise. Watch it with a partner number, reviewer concentration. When eighty percent of your merges are getting approved by the same two engineers, you don't have an agentic productivity win. You've got a reviewer bottleneck wearing a productivity costume.

3. Rework ratio

The percentage of commits in the last thirty days that revert, amend, or substantially refactor a commit also from the last thirty days. That's your cleanest read on the revert churn pattern. A high rework ratio means your codebase is oscillating, not advancing. I want that number under ten percent on a healthy codebase.

4. Time to restore

This one stays exactly where it was in the original DORA stack, but it gets a promotion. It becomes your lead indicator. When a third of your changes are coming from non-human authors, you are going to break things you didn't predict. So if mean time to restore is creeping up, fix that first and let the other numbers catch up. Money spent on observability, error budgets, and incident command pays back faster than anything else you can do when agents are committing to your main branch.

5. Production confidence score

A composite, intentionally coarse. Test coverage on changed lines. Percentage of changes behind a feature flag at rollout. Percentage of deploys with automated rollback configured. Percentage of deploys with observability hooks added. Roll it into one number whose only job is to tell leadership how risky the average production change actually is. And to keep that number in front of the people approving how much you let agents merge on their own.

That's the engineering half. Outcome latency, reviewer load, rework ratio, time to restore, production confidence. Keep your four originals on the dashboard. Just stop letting anyone be judged on them alone.

Flow: dependency density on the program side

Cross over to the program side, because the same disease has the same cure in a different dialect.

If critical path is the metric that lies in graph-shaped programs, dependency density is the one that tells the truth. And it's simpler than the name makes it sound. For a cluster of work, you count the edges between the items and you divide by the number of items. Edges per node.

The trick is being honest about what counts as an edge. I use four buckets.

  • Code-level edges. Direct API calls, shared libraries, shared schemas. Easy to count off an architecture diagram.

  • Data-level edges. Shared data stores, an event stream two services both consume, a batch job that builds one system's output from another's tables. These get missed constantly, because the contract is implicit.

  • Process-level edges. Shared on-call rotations, shared release trains, a shared change advisory board. If two teams can't ship independently because they share a release window, that's an edge.

  • Human-level edges. Two work items that need the same person to approve, the same architect to design, the same legal review to clear. These look soft. They bite the hardest, because they don't have a ticket attached to them.

Score the four, add them up, divide by nodes. Now the scale.

DensityWhat to do
Below 1.0Cluster is healthy
1.0 to 2.0Active monitoring, weekly review of which edges changed
Above 2.0Stop adding scope, start refactoring
Above 3.0The cluster is not shipping on the date. The only question is how early you have the honest conversation.

Here's the part most people miss. A single density score is a snapshot. The trend is the story. I've watched three teams over twelve sprints tell three completely different stories with this one number.

The first team sat around 1.5 the whole time. They hit their dates, give or take a sprint. The second team crossed the line at sprint five and kept climbing. By sprint seven you should have called the re-baseline. By sprint nine the slip was already locked in, you were just choosing how to phrase it. The third team did the work nobody claps for. Actively decoupling. Retiring shared dependencies. Paying down the integration tax. Their density went down over time. They didn't ship fewer features. They shipped them more predictably.

That trend view does something political, too. It finally makes decoupling work visible. Platform investment, integration cleanup, contract hardening, none of it shows up in feature throughput. It shows up here, as density going down. Once you chart it, your engineers stop having to fight for that work like it's overhead. It becomes a number leadership can watch moving the right way.

Step back and look at the two halves together. Outcome latency on the engineering side. Dependency density on the program side. Different disciplines. Same job. Both of them refuse to let motion pose as progress. One asks, did the code we shipped actually become value? The other asks, will the coupling in this program even let us ship at all?

And agentic AI makes both worse, in exactly parallel ways. On the engineering side it inflates your deploy count with work no customer asked for. On the program side it drags in a feature store, an eval harness, a model gateway, a routing layer, a guardrail service, a logging pipeline, and a human review interface, and every one of those is a cluster with its own edges into your existing platform. The teams shipping agents well aren't the ones with the best model. They're the ones who treated the integration density as the program risk. Because model quality is a feature problem. Density is a delivery problem. Get the density right, and swapping models gets boring. Get it wrong, and every model upgrade turns into a six-week regression hunt.

Outcome: 200 deploys a week and a broken program

A composite case. The shape is real, the names are changed.

A program I got pulled into. On paper, a star. The engineering org had just crossed two hundred deploys a week and was reporting elite on every DORA number. The program dashboard was green. Critical path landed inside the deadline. By every headline metric in the building, this thing was a triumph. And the sponsor could feel that it wasn't.

We did two things. Two spreadsheets, basically.

First, on the engineering side, we segmented change failure rate by author type. Human. Agent assisted. Agent authored. The blended number they'd been reporting was fine. The agent-authored number was almost three times higher. I have watched this exact moment flip an executive team in a single meeting. They'd been approving wider and wider agent merge autonomy off a blended number that was hiding the one thing they most needed to see. We also ran the rework ratio. It came back at twenty-six percent. More than one in four commits in the last month was undoing or redoing a commit from the same month. That's not a delivery engine. That's a treadmill.

Second, on the program side, we scored the three highest-risk clusters for dependency density. Two were fine, around 1.4. The third, the one wiring the new agent into the payments surface, came back at 3.4. Critical path had that cluster colored green, because the longest chain through it happened to be short. Density knew the truth. That cluster was never going to make the date.

We didn't work harder. We made three structural moves.

We collapsed one pair of services that had become a distributed monolith. Every change to one forced a coordinated change to the other. That's not two services. That's one service in two repositories. Merging them deleted a whole category of edges.

We decoupled the payments integration behind a stable contract, so the agent side and the ledger side could move without a six-way negotiation every sprint.

And we cut scope. Specifically the work items that contributed the most edges, even the small ones. Because density is multiplicative, not additive. Pull one high-edge item, and the cluster's density drops further than pulling three low-edge ones.

The numbers landed here:

  • Rework ratio: 26% → 9%.
  • Agent-authored failure rate came back in line with the human rate, once we rebuilt the review process to actually look at agent code instead of rubber-stamping it.
  • Payments cluster density: 3.4 → 1.9 over four sprints.
  • The cluster shipped. Late, but on a date we actually predicted, six weeks earlier than the original lie would have let us admit it.
  • Deploy count went down on purpose. Because half of it was motion. When we started measuring outcome latency instead, the picture got honest, and honest was slower on paper and faster in reality.

The second-order effect, and it's the one I care about most. The conversation changed. Before, every status meeting was an argument about velocity. Are we going fast enough? Why is engineering slow? After, the argument was about value. That cluster's at 3.4, here's what it costs to bring it down, here's the date that buys us. You stop arguing about how busy everyone is, and you start arguing about what's actually going to ship. Governance gets sharper, because you're finally pointing at a number that means something.

The migration plan

Don't swap your dashboards overnight. The migration that works runs about three quarters.

Quarter one. Instrument outcome latency and rework ratio right next to the old numbers, and don't act on them yet. You're calibrating. The first month is noisy.

Quarter two. Add reviewer load and the confidence score, and start segmenting failure rate by author type. That's usually the quarter leadership wakes up.

Quarter three. Demote deploy frequency and critical path from headline to context. They stay on the dashboard. Nobody's judged on them alone anymore.

The takeaway

Pick one program. Not the portfolio.

Two spreadsheets. That's the whole ask. On the first one, take last month's changes and segment your failure rate by author type. Human, agent assisted, agent authored. Just look at the agent-authored number sitting next to the blended one you've been reporting. On the second, pick your three highest-risk clusters, count the edges (code, data, process, people), and divide by the nodes. Write the three density scores down. Then do it again next week, and watch which direction they move.

You're going to find one number that's been flattering you, and one number that's been trying to warn you.

That's the whole discipline. Trust the number that hurts.

What's next

A future episode picks up the governance version of this conversation. Once an agent authors a third of your changes and chains decisions across your systems, the old question is this model accurate stops being enough. The new question is what is this agent allowed to do, and who's accountable when it acts. We'll work through what that looks like when you're the one who has to sign the policy.

If today's episode helped, send it to the engineering leader, the program director, or the platform lead in your org who's staring at a green dashboard with a bad feeling about it.

Listen

  • Spotify · open episode 2 · follow the show
  • Apple Podcasts · link populates once the show passes Apple's review (claim is in flight)
  • Amazon Music · link populates once Amazon finishes processing the feed
  • Pocket Casts / Overcast / Castbox / Podcast Addict · auto-discovered from the RSS feed once Apple goes live
  • RSS · subscribe in any podcast app at the show's RSS feed

Get notified when episode 3 drops

Drop your email on the Valuestream show page and you'll get a single line in your inbox the morning episode 3 publishes. Same list as the blog notifications. No fluff in between.

This is Valuestream. I'm Rick Pollick. Trust the number that hurts.

ValuestreamValuestream PodcastRick Pollickpodcast episode 2DORA metricscritical pathdelivery metricsoutcome latencyreviewer loadrework ratioproduction confidencedependency densityagentic AIagentic AI metricsplatform engineeringprogram managementPMOengineering productivityagent authored codechange failure rate
Valuestream Episode 2: Trust the Number That Hurts. The Metrics That Lie in the Agentic Era, and What to Measure Instead — Rick Pollick