When AI Helps, When It Hurts: A Decision Tree for Knowledge Workers

For a long time the AI conversation has been “is it good or bad.” The honest answer, for high-stakes knowledge work, is that it depends entirely on the task. The same model, the same operator, the same week — confidently better on one task and measurably worse on the next.

The most-cited controlled study of generative AI in professional services is the BCG × Harvard “Navigating the Jagged Technological Frontier” experiment. 758 BCG consultants were randomized: half got GPT-4, half worked the way they always had. The results split sharply by task type.

Inside AI's frontier

+40%

Higher-quality output on creative, ideation, and synthesis tasks.

Consultants using AI completed 12.2% more tasks, 25.1% faster.

Outside AI's frontier

−19 pp

Less likely to reach the correct answer on judgment-heavy quantitative reasoning.

Same operator. Same week. Different task. Worse than no AI at all.

The 40% rework problem most firms are seeing is not because AI is bad. It is because the same person is reaching for the same tool on tasks where it lifts them and tasks where it sinks them, and they cannot tell which is which until the deliverable goes out the door. The skill that separates senior practitioners who get durable value from AI from those who don't is not better prompting. It is task triage.

Two questions, four zones, and a task-by-task map. You can run the whole thing in under sixty seconds before any task you are about to hand to a model.

The two questions that decide everything

Before you reach for AI on any task, ask:

Question 01

Is the output verifiable in under five minutes?

Not “is it possible to verify it” — almost any output is verifiable given infinite time. The question is whether someone could spot-check it in less time than it would have taken to do the task without AI. If verification takes longer than the original work, AI is a productivity loss even when the output is correct.

Question 02

What is the asymmetric cost of being wrong?

If a wrong answer costs you ten minutes of editing, the asymmetry is small. If a wrong answer ends up in a press release, a regulatory filing, a quote attributed to a real human, or a client deliverable — the asymmetry can be career-defining. The verification rung you climb is set by the downside, not the upside.

The asymmetry is the part most teams underweight. Senior practitioners intuitively price the upside of a saved hour but routinely under-price the downside of a fabricated quote, a confabulated statistic, or an attribution that isn't quite right. The two sides of the ledger are not the same shape.

The asymmetry, visualized

If right30 min saved

If wrong8+ hours rework, lost trust, possible retraction

Verification effort is calibrated to the bottom bar, not the top one. That single move — pricing the downside honestly — is most of the work.

Together, the two questions classify almost any task into one of four zones.

The four-zone map

Plot any knowledge-work task on two axes — verifiability on the horizontal, stakes on the vertical — and four meaningfully different zones emerge. Each zone has its own operating discipline.

The decision map

Stakes × Verifiability

HighStakesLow

Rose · Don't AI alone

Quote attribution, legal interpretation, regulatory math, source-of-truth research.

Verification approaches the cost of doing it by hand. Asymmetric downside.

Amber · AI + verification

Client-facing drafts, summary statistics, framework-driven analysis, cited research.

AI is faster — but verification is non-negotiable, every time, no exceptions.

Slate · AI carefully

Brainstorm directions, generated options, exploratory analysis, internal ideation.

Output is suggestive rather than load-bearing. Be skeptical; the downside is bounded.

Emerald · Use AI freely

Reformatting, simple translation, code refactoring, first-draft brainstorms, table cleanup.

Output is obviously right or wrong. A wrong answer costs minutes. Most real productivity gains live here.

Hard← Verifiability →Easy

The error most firms make is reaching for AI on rose-zone tasks because the time-savings look attractive on the upside. The math rarely works. A 30-minute saving on a quote that turns out to be fabricated costs a week of damage control, the relationship with the source, and a permanent dent in trust. The asymmetry is doing the work, not the model.

Done well, the four-zone map is not a rulebook. It is a 60-second pre-flight check that runs before any AI-assisted task and re-runs whenever the stakes or the verification cost changes mid-task.

Task by task: where each kind of work lives

The map is abstract; the work is concrete. The categories below cover most of what knowledge workers actually do in a week. The zone shifts within a category depending on whether the output is internal, client-facing, or attributed to a third party.

Task

Emerald

Amber

Rose

Drafting

Internal memos, agenda outlines, ideation drafts.

Client-facing first drafts you'll re-structure.

Attributed quotes, legal language, compliance claims.

Research

Landscape scans, summarize-this-paper with paper provided.

Citation-backed claims you'll publish.

Source-of-truth facts you'll act on directly.

Analysis

Hypothesis generation, “five ways to think about X.”

Framework-driven analysis with verified inputs.

Quantitative analysis where digits matter.

Summarization

Long-form to short-form when model has the source.

Meeting notes, action items, decision logs.

Summary that becomes the basis for action.

Decisions

Generate options, list pros and cons.

Recommendation as a hypothesis, not a verdict.

The decision itself. That is your job.

Attribution

—

Anything attributed to a real human. Always. Get the quote.

The most important row is the last one. The fastest way to do real damage with AI is to put words in someone's mouth they did not say. Even with attribution-style scrubbed from the prompt, the model can confabulate quotes that read plausibly enough to slip past a casual review. There is no version of this task that lives anywhere but rose. If you need a quote, get it from the human.

In Hone Studio

The Knowledge Base + retrieval-augmented generation shifts the verifiability axis meaningfully. When the model is grounded in your own documents and serves outputs with citations attached, many amber-zone research tasks become emerald: the model is not generating plausible-sounding text from training data, it is quoting from your sources and pointing at them. Verification collapses from “look up every claim” to “click the citation.”

The verification ladder

For any amber-zone task, the verification step decides whether AI was a net positive. Different tasks call for different rungs of the ladder. The discipline is matching the rung to the stakes — not defaulting to the easiest rung because the output looked good.

Spot-check · 10 seconds

Read it, ask “does this look approximately right?” Sufficient for low-stakes amber: reformatting, code refactor of familiar code.

Cross-reference · 60 seconds

Compare against one independent source. Sufficient for moderate amber: framework-driven analysis, structural drafts.

Original-source check · 5 minutes

For every load-bearing claim, verify against the primary source. Floor for client-facing deliverables.

Expert review · variable

A human with subject-matter authority signs off. Required for rose-adjacent tasks where AI is scaffolding, not the deliverable.

Rung 1 on a rung-3 task is how the 40% rework problem gets generated. Rung 4 on a rung-1 task is how AI productivity gains get eaten by verification overhead. Calibration is the whole game.

When memory and retrieval are in the loop

Most of this calculus assumes vanilla AI: a chat window, no document context, no memory of your past work. The four-zone map is built around that default. When the model is operating with retrieval against your own documents and persistent memory of your firm's voice and prior work, the geometry shifts.

The shift is asymmetric. The emerald and amber zones widen meaningfully, because the failure mode that creates the worst rework — confabulation of plausible-but-wrong specifics — is the failure mode that retrieval most directly suppresses. Citations move verification from “look up every claim” to “click the link.” Memory means the model has already absorbed your firm's voice, positioning, and prior client context, so the output arrives in-house-shaped rather than generic. The rose zone does not dissolve — quote attribution, legal interpretation, and the decision itself are still your work — but it gets smaller.

In Hone Studio

Hone Studio is built around this shift. The Assistant runs against a Knowledge Base of your firm's actual documents — playbooks, past deliverables, client materials — using retrieval-augmented generation and hypothetical-document expansion to find the right context before answering. Memory carries your firm's voice and accumulated institutional knowledge across sessions. The result is that the everyday user sends short, conversational prompts and gets back work that lives in the emerald or low-amber zone for tasks that would have been amber or rose without grounding. The triage discipline still matters. The number of tasks that survive it goes up.

The decision in one diagram

Every framework above collapses to four leaves. Before the next task you reach for AI on, walk this branch in your head:

The triage in one diagram

New task · reach for AI?

Verifiable in < 5 minutes?

High stakes?

Yes

Rose

Don't AI alone

Slate

AI carefully

Yes

High stakes?

Yes

Amber

AI + verify

Emerald

AI freely

The skill that lasts

The wrong way to read this piece is as a set of rules. Two years from now the model will be smarter, the verifiability of certain tasks will be different, and the 2×2 will need to be re-drawn. New tools will pull tasks from rose into amber, from amber into emerald. Some tasks that look emerald today will turn out to have hidden downside that pulls them back into amber.

The skill that lasts is pattern recognition: looking at any task and asking, instinctively, “where does this live on verifiability and stakes?” — and adjusting accordingly. People who are durably good at AI for high-stakes knowledge work have internalized that question. They reach for AI confidently on the right tasks, suspiciously on others, and not at all on the third group. They calibrate verification to the downside, not the upside. They build their own task taxonomies for their own work, and they update them as the models change.

That is not a skill about AI. It is a skill about deciding where to spend your attention. Which is the actual job.