The Productivity Illusion: Why You Can't Trust Your Own Sense of the AI Speedup

A partner at a strategic communications firm will tell you, without hesitation, that AI cut her brief-writing time in half. She is articulate about it. The first draft used to take an afternoon; now it takes an hour. She has said this in a pitch meeting. She believes it completely. She has also never once timed it.

This is the most common — and most expensive — mistake in the way senior knowledge workers reason about AI. Not a mistake about prompting, or models, or which tool to buy. A mistake about measurement. The felt speedup and the measured speedup are two different numbers, and the gap between them is the entire story. Most people are managing the first number and have never looked at the second.

For most of the AI conversation, the felt number has been allowed to stand in for the real one, because the real one is genuinely hard to get. Then a research lab built the experiment carefully enough to get it. The result was not subtle.

The number that surprised the researchers

In mid-2025, the nonprofit research group METR ran a randomized controlled trial that almost nobody in the AI-productivity discourse had attempted: it measured the actual speedup, with a stopwatch, on real work. The setup was clean. Sixteen experienced open-source developers — people working in codebases they had maintained for years, a median of five years and tens of thousands of stars — took 246 real tasks from their own projects. Each task was randomly assigned to one of two arms: AI-allowed or AI-disallowed. The researchers timed everything.

Before the study, these developers expected AI to make them roughly 24% faster. After the study — having lived through it — they still believed AI had sped them up by about 20%. The stopwatch said otherwise.

What it felt like

+20%

How much faster the developers believed AI had made them — after finishing the work.

They predicted +24% going in. Experience barely moved the estimate.

What the stopwatch said

−19%

How much slower they actually were on AI-allowed tasks.

Same people. Same week. Same projects. A 39-point gap between feeling and fact.

Read those two cards together, because the pairing is the finding. These were not novices fumbling with an unfamiliar tool. They were domain experts, on home turf, who had every incentive to be honest with themselves — and they were off by 39 percentage points in the wrong direction. They felt a tailwind. They were walking into a headwind. The full results are in METR's July 2025 write-up and the accompanying paper.

The instinct, on first reading, is to argue with the result. Sixteen developers is a small sample. Open-source maintenance is a specific kind of work. The models were early-2025 models and have improved since. All true, and all beside the point. The point is not that AI universally makes everyone 19% slower — it doesn't, and other studies find genuine gains on other tasks. The point is that the people in the study could not feel the difference between a 19% loss and the 20% gain they reported. Their internal speedometer was not just imprecise. It was pointed the wrong way. If your sense of your own speedup can be that wrong on familiar work, the sensation of being fast is not evidence of anything. It is the thing that needs to be checked.

Where the time actually goes

How does someone end up feeling 20% faster while being 19% slower? The answer is that the felt speedup is real — but it's measuring the wrong segment of the work. The model produces a draft in seconds, and that moment is fast. It feels like progress because it is progress, locally. What the feeling cannot see is everything that happens after the draft lands: the reading, the checking, the second-guessing, the quiet rewrite of the third paragraph that was subtly, plausibly wrong.

There are two taxes hiding in that gap, and naming them separately is the first step to managing them.

The verification tax is the cost of confirming the output is correct before you rely on it. It is the part of the work that the generation step quietly transfers onto you. A model writes a paragraph in two seconds and hands you a paragraph that takes four minutes to verify — and the four minutes don't feel like work, because they feel like reading, not writing. But verification is load-bearing. Skip it and the errors ship. Do it honestly and the clock keeps running long after the draft appeared.

The rework tax is what you pay when verification finds a problem: the rewriting, the re-prompting, the fixing of the factual error or the voice mismatch or the structural choice that doesn't survive contact with the actual brief. This is the tax that Workday's 2026 “Beyond Productivity” research put a number on: 85% of employees say AI saves them one to seven hours a week, and nearly 40% of those savings get burned back in rework. Only 14% of workers report consistently positive net outcomes. The gross savings are real. They are also not what lands in your week.

The reason these taxes are nearly invisible is that they don't arrive as a single bill. They accrue, one small confirmation and one small correction at a time, in the part of the workflow that feels like diligence rather than cost. Watch them stack:

Generation · the part you feel

The draft appears in seconds. This is the entire basis of the “it's so much faster” sensation — and it is genuinely fast.

Reading & verification · the tax you don't count

You read the whole thing closely enough to trust it. Feels like reviewing. Functions like rework you haven't done yet.

Correction & re-prompting · the rework tax

The third paragraph was wrong in a firm-specific way. You rewrite it, or re-prompt and re-read. Workday's ~40% lives on this rung.

Context re-assembly · the tax you pay every session

Next time, the tool remembers nothing. You re-explain the client, the voice, the constraints. The meter resets to zero.

Notice the color gradient. Rung 1 is the only one that registers as a speedup. Rungs 2 through 4 are where the speedup goes to die, and they are precisely the rungs your sense of progress is worst at seeing — they wear the costume of careful work. The METR developers weren't lying about feeling faster. They were feeling rung 1 and forgetting to bill themselves for rungs 2 through 4.

Net time saved: the only honest metric

Once you can see the taxes, the right metric writes itself. Gross savings — how fast the draft appeared — is the number everyone quotes and the number that means nothing. The number that matters is what is left after the taxes are paid:

The only equation that matters

Net time saved = gross generation savings − verification tax − rework tax − context re-assembly

When this number is positive, AI made you faster. When it is negative — as it was for the METR developers — you were slower, no matter how fast the draft felt. Everything in this piece is in service of moving one number from negative to positive, and being able to prove which side of zero you are on.

Work an example with honest numbers. You have a recurring deliverable — a client briefing memo — that takes you 90 minutes by hand. With AI, the first draft arrives in 5 minutes. That is the moment that feels miraculous, and it is where most people stop counting. Now keep the meter running. Reading the draft closely enough to trust it: 15 minutes. Checking the two statistics it cited, one of which turns out to be a plausible fabrication: 20 minutes. Rewriting the section that came back in generic-consultant voice instead of your firm's: 25 minutes. Re-explaining the client's positioning to the tool, because it remembered nothing from last week: 10 minutes.

The 90-minute memo, with AI

Minutes

Generation — first draft appears

Reading & verification — trust the draft

+15

Citation checking — one stat was fabricated

+20

Rework — rewrite generic voice to house voice

+25

Context re-assembly — re-explain the client

+10

Total with AI

Net time saved vs. 90 by hand

In this version, AI did help — 15 minutes net, a genuine 17% gain. But notice how fragile that is. The felt speedup was 18× (90 minutes down to a 5-minute draft). The real speedup was 1.2×. And the whole positive result hinges on the taxes staying small. Let the citation check find three fabrications instead of one, or the voice rework run to 45 minutes because the draft was confidently off-tone, and the total crosses 90. You are now in METR territory: slower, while certain you were faster. The difference between a real gain and a real loss is not the generation step — that part is always fast. It is the size of the taxes. Which means the entire game is making the taxes smaller, not making the generation faster.

A 30-day self-measurement protocol

You cannot manage a number you have never measured, and you cannot trust the felt number — that is the whole lesson of the METR result. So measure it. Not forever, and not on everything. Pick one recurring deliverable, instrument it for a month, and let the stopwatch tell you what your intuition can't. The protocol is deliberately boring, because boring is what survives a busy week.

Week 0 · Pick one deliverable and baseline it

Choose something you produce repeatedly — a memo, a release, a section of a proposal. Do three of them the old way, AI-free, and time each end to end. That average is your baseline. Without it, you have nothing to compare against and you're back to vibes.

Weeks 1–4 · Time both arms, honestly

For each instance, log the AI-assisted total — and crucially, keep the meter running through verification, correction, and re-prompting. The clock stops when the deliverable is ready to ship, not when the draft appears. Occasionally do one the old way to keep the baseline current as you get more practiced.

Throughout · Log every error and rework event

One line per correction: what was wrong (fabricated fact, wrong tone, bad structure), how you caught it, how long the fix took. This log is the most valuable artifact you'll produce — it tells you which tax is eating you and therefore what to fix.

Day 30 · Compute net, then compare it to what you felt

Net = baseline minus AI-assisted total, averaged. Then write down, before you look, how much faster you felt AI made you. The gap between your felt number and your measured number is your personal version of METR's 39 points — and it is the single most useful thing you will learn about how you work with AI.

Most people who run this protocol discover two things. First, their felt speedup was inflated — sometimes dramatically, occasionally inverted. Second, and more usefully, the error log points at exactly one tax doing most of the damage. For the majority of knowledge workers, that tax is verification: the time spent confirming that plausible-sounding output is actually true. Which is the lever worth pulling.

Why grounded systems change the arithmetic

Here is the move that turns the whole analysis from a counsel of despair into a plan. If net time saved is gross savings minus the taxes, and the generation step is already nearly free, then you don't make AI faster by making it generate faster. You make it faster by collapsing the taxes — and the largest, most reducible tax is verification.

Verification is expensive for a specific, fixable reason: an ungrounded model generates plausible text from training data, so the only way to know whether a claim is true is to go find the source yourself and check it by hand. That is the rework problem in miniature. You are re-doing the research the model pretended to do. The verification cost approaches the cost of doing the task without AI at all — which is exactly how a felt speedup becomes a measured loss.

Grounding changes the arithmetic at the root. When a system retrieves from your own authoritative documents and attaches a citation to each claim, verification stops being “go re-research this” and becomes “click the citation and confirm the passage says what the sentence claims.” The first is minutes per claim. The second is seconds. That single substitution is the difference between the +20% you feel and the +20% you can actually measure. It doesn't make the model smarter. It makes the tax payable in seconds instead of minutes — and the taxes were the whole story.

In Hone Studio

This is the structural lever, built in. When the Assistant runs in Knowledge Base mode, every answer is grounded in your own uploaded documents and arrives with inline citations, plus a sources panel showing exactly which entries were retrieved. So the verification tax drops from “go find the source and redo the research” to “click the citation and read the passage.” Citations are assistive — they help you locate and confirm sources, not a guarantee of correctness, and every output is still a draft a person approves — but moving verification from minutes to seconds is precisely how a felt speedup turns into a measured one.

The same logic applies to the other two taxes. The rework tax shrinks when the model already knows your firm's voice and prior positions, so the draft arrives in-house-shaped instead of generic-consultant — less to rewrite. The context re-assembly tax disappears when the system is stateful rather than stateless: when the Knowledge Base holds your prior work and Memory holds the explicit rules and preferences you've taught it, you stop re-explaining the client to a tool that forgot you last week. Every one of these is the same structural move — moving a cost from your week into the system's standing infrastructure — and every one of them shows up on the only line that matters: net time saved.

The goal was never to feel fast

The seductive thing about AI is that it makes the felt-productivity number go up immediately and effortlessly. The draft appears, the sensation of speed arrives, and the brain files it as a win before a single tax has been paid. The METR developers are not unusual in this. They are all of us. The sensation of being fast is so reliable, and so decoupled from the reality of being fast, that it has become the default way people evaluate these tools — which is to say, people are evaluating the one number that doesn't matter and never measuring the one that does.

The partner who is sure AI halved her brief time might be right. She might also be the METR study in miniature — 18× faster at the draft, slower at the deliverable, certain of a gain she is actually giving back in verification and rework she doesn't count as cost. The only way to know is the unglamorous one: baseline it, time both arms, log the rework, compute the net. Then build the workflow that makes the taxes small — grounding for verification, voice memory for rework, persistent context for re-assembly — and measure again.

Because the goal was never to feel fast. Feeling fast is free, and worth exactly what it costs. The goal is to be fast — and, more than that, to be able to prove it, with a number you actually measured, on the only line that ever mattered.