Building Your Firm's Knowledge Base: What to Put In, What to Leave Out

A boutique firm we know did the obvious thing. They had a knowledge base now, so they filled it. Six years of shared drive — every proposal, every deck, every draft, the “FINAL” and the “FINAL-v3” and the “FINAL-actually-use-this” — dragged in over a long weekend. Forty-two hundred documents. The logic was intuitive: more material, more memory, smarter answers.

The answers got worse. Asked for the firm's position on a recurring client question, the assistant returned three different positions, each confidently sourced, each from a memo that had been superseded a year apart. Asked for the approved methodology, it blended a current framework with a deprecated one that still lived in a 2022 pitch. Nothing it said was invented. Everything it said was retrieved. That was the problem.

Here is the uncomfortable mechanic underneath that story: a retrieval system is only ever as good as what it retrieves. The model does not weigh your documents for currency or authority. It pulls what matches the question and writes from what it pulled. Feed it three contradictory versions of the same truth and it will faithfully surface the contradiction — usually without telling you that is what it did. More documents is not more intelligence. The right documents are.

This is the discipline nobody puts in the demo: information architecture for retrieval. Canonical sources, a taxonomy that holds up, clear ownership of what stays current, and — the part teams resist most — a deliberate decision about what to leave out. Get this right and the platform's promise comes true: the more you put in, the more every module has to retrieve from — and the sharper its answers. Get it wrong and you have built a very expensive way to retrieve your own mistakes.

The dump-everything reflex

95%

of enterprise generative-AI pilots fail to reach production.

MIT's analysis traces the failure not to the models but to data and context — the corpus, not the engine.

The curated-corpus payoff

~5%

of firms are capturing AI value at scale — the ones investing in context, not just tools.

The difference is rarely the model. It is what surrounds the model: structured, current, retrievable knowledge.

The two numbers are the same finding read from both ends. MIT's GenAI Divide report — drawn from 150 leader interviews and 300 public deployments — documents a 95% pilot failure rate, and the root cause it keeps returning to is data and context rather than model capability. BCG's September 2025 survey of more than 1,250 firms finds the inverse: roughly 5% are achieving value at scale, and they get there by redesigning the work around the AI rather than pasting the AI onto the work. McKinsey's State of AI names the data foundation explicitly: the high performers treat their knowledge as infrastructure, not exhaust. Your knowledge base is that foundation. What you put into it — and what you keep out — is the single highest-leverage decision in the entire deployment.

Canonical sources: one authoritative version

Start with the failure mode from the opening, because it is the most common and the most damaging. A firm's shared drive does not contain its knowledge. It contains every attempt at its knowledge, layered in sediment, with no signal for which layer is current. Humans navigate this by social knowledge — “oh, ignore that folder, Dana moved everything in March” — that no retrieval system has access to.

A canonical source is the one version of a thing that is true right now. One messaging framework, not five. One approved bio per principal, not the 2021 bio and the 2023 bio and the conference-program bio. One statement of methodology. The discipline is not cleverness; it is editorial nerve — deciding, for each category of knowledge, what the single authoritative artifact is, and committing to it.

This matters more in a retrieval system than it ever did in a folder, because the failure is silent. In a shared drive, a stale document sits inert until someone opens it. In a knowledge base, a stale document is an active participant in every answer that matches it. It does not wait to be wrong. It volunteers.

Knowledge type

Shared-drive reality

Canonical version

Messaging framework

Five “final” decks across three years.

One current framework. Prior versions archived offline.

Principal bios

One per pitch, drifting in title and tone.

One approved bio each, owned by the principal.

Methodology

Current method tangled with a deprecated one.

The approved methodology, dated and singular.

Boilerplate

Copy-pasted, mutated per proposal, never reconciled.

One approved block per use, the source of truth.

The test for a canonical source is a question you can answer in one breath: if two documents in the corpus disagree, which one is right? If you cannot answer that instantly for a given category, you do not yet have a canonical source for it — you have candidates. Resolve the candidates before they reach the retrieval layer, not after the assistant has already blended them into an answer a client is reading.

What to put in

The instinct to dump everything comes from a real fear: that leaving something out means losing it. But a knowledge base is not an archive. Its job is not to preserve every artifact your firm has ever produced. Its job is to make the firm's operative knowledge — the material that should actually ground new work — retrievable on demand. That is a much smaller, much sharper set.

Five categories earn their place because each one directly shapes better output when an answer is grounded in it:

Strategic plans and positioning. The documents that define what the firm believes and where it is going. When the assistant grounds a draft in these, the output inherits the firm's actual point of view instead of a generic one.
Approved frameworks and methodologies. The repeatable intellectual property — the models, the processes, the named approaches a client is paying for. This is the highest-value material in the corpus because it is the most reused.
Style guides and voice samples. Not a paragraph that says “we are authoritative but warm,” but actual exemplar prose the firm is proud of. Voice transfers through artifacts, not adjectives.
Exemplar deliverables. The best version of each recurring output type — the proposal that won, the report that set the standard, the memo everyone quietly copies. These become the shape new work is poured into.
Decision records. The why behind the what. Why a recommendation went one way, why a sponsor prefers a certain framing, what a post-mortem concluded. This is the layer that normally lives only in someone's head — and walks out when they do.

Notice the through-line. Every category on this list is something a new hire would otherwise have to absorb by osmosis over a year of watching how the firm works. That is exactly the knowledge worth making retrievable, because it is exactly the knowledge that is expensive to reconstruct and easy to lose.

In Hone Studio

The Knowledge Base ingests PDF, DOCX, Markdown, CSV, and Excel, then reads, processes, and indexes each file automatically so every other module can draw on it. Search works by meaning rather than keywords — ask for “our position on regional partnerships” and it surfaces the relevant strategic plan even if those exact words never appear in it. Document preview and metadata let you confirm what each entry actually contains before it grounds a client deliverable. The curation discipline in this post is what turns “more documents in, sharper answers out” from a slogan into a fact.

What to leave out

This is the section teams skip, and skipping it is why the firm in the opening got worse answers from more documents. Exclusion is not neglect. It is the most active editorial decision in the whole exercise, because every document you admit becomes eligible to ground an answer — and some documents are actively harmful in that role.

Four classes of material belong outside the retrieval layer:

Put in

Current strategic plans and positioning
Approved frameworks, methods, and named IP
Style guides and exemplar prose
Best-in-class deliverables, one per type
Decision records and rationale
Approved boilerplate and fact sheets

Leave out

Superseded drafts and earlier “final” versions
Contradictory copies of the same artifact
Low-signal noise — scratch notes, dead threads
Material that must never ground client work
Anything you cannot vouch for the currency of
Confidential third-party material outside its perimeter

Superseded drafts. The single biggest source of contradiction. Every earlier version of a living document is a quiet vote for an answer you no longer endorse. Archive them somewhere a human can find them for legal or historical reasons — just not where the retrieval layer can reach them.

Contradictory versions. When two current documents genuinely disagree, that is not material to retrieve; it is a decision you have not made yet. Make it. Pick the canonical one. A retrieval system cannot adjudicate your firm's position for you, and it will not try — it will simply hand both to the next person who asks.

Low-signal noise. Scratch notes, abandoned threads, half-formed thinking. These do not just fail to help; they dilute. Every irrelevant chunk that competes for retrieval is a chunk of real signal that did not surface. Semantic search has finite attention. Spend it on substance.

Anything that should not ground client work. Internal candor about a client, unvetted competitive intelligence, personal notes, material under someone else's confidentiality perimeter. The question is not “is this useful to a human?” It is “am I comfortable with this shaping an answer that goes out the door?” If the honest answer is no, it stays out.

The currency test

Before any document enters the corpus, one question gates it: can I vouch that this is true right now? Not “was this true when it was written” — true today. A funded proposal from 2023 may be a superb exemplar of structure and voice while containing budget figures that are now wrong. Admit it for the form; flag or strip the stale specifics. Currency is not a property of the document. It is a judgment only a person can make.

Taxonomy and ownership

A curated corpus still needs structure, for the same reason a well-stocked kitchen still needs a layout. Taxonomy is how you keep the retrieval layer's attention pointed at the right shelf, and how a human keeps the whole thing legible enough to maintain. Two levers do most of the work: organization and ownership.

Organization means folders and tags that match how the firm actually thinks — by client, by deliverable type, by practice area, by status. The goal is not a perfect ontology; it is a structure that lets you scope retrieval (“answer only from approved frameworks”) and lets a maintainer find the canonical document in seconds when it needs updating. A taxonomy nobody can navigate is a taxonomy nobody will maintain.

Corpus · everything the firm has admitted

Folder · a category the firm thinks in

Canonical doc · tagged, owned, current

One authoritative version. Someone's name on it.

Structure narrows from everything admitted, to a category, to the one version that grounds the answer.

Ownership is the lever firms forget, and it is the one that decides whether the corpus stays trustworthy six months in. A knowledge base is not a build; it is a living system, and living systems decay without a gardener. Every category of knowledge needs a person accountable for its currency — someone whose job is to notice when the positioning shifts, when a framework is retired, when last year's exemplar has been beaten by this year's.

Ownership & currency

Who notices when this becomes wrong?

For every category in the corpus, name the person accountable for its currency and the cadence on which they review it — quarterly for fast-moving positioning, annually for stable methodology. A document with no owner is a document that will be correct on the day it is uploaded and slowly, invisibly wrong after that. The curation is the institutional memory; the ownership is what keeps the memory honest.

Letting it compound

Here is the inversion the dump-everything reflex gets backwards. “More documents, better answers” is true — but only with the word good silently in front of “more.” A curated corpus compounds; a dumped one congeals. The difference is whether each new document raises the average quality of what gets retrieved or lowers it.

When the discipline holds, the flywheel is real and it is the entire point. McKinsey's high performers do not win by buying a better model than their competitors; they win by treating their knowledge as a compounding asset and redesigning the work to draw on it. BCG's finding is the same one stated as a warning: the firms stuck at zero return are the ones who automated tasks without ever building the foundation the tasks were supposed to draw on. A clean, canonical, owned knowledge base is that foundation. Each good document you add makes the next draft, the next answer, the next review a little more like the firm's best work and a little less like the internet's average.

In Hone Studio

A well-curated Knowledge Base is the ground everything else stands on. The Assistant grounds its answers in your documents with inline citations, so a reviewer verifies by clicking the source rather than redoing the research. Memory captures the tacit layer — the decisions, the preferences, the judgment calls — automatically as you work, so the rationale behind a choice survives the person who made it. Citations are assistive: they trace a claim back to your source material so a human can verify it quickly, not a guarantee of correctness — every output is a draft until a person signs off. The cleaner the corpus underneath, the more that grounding is worth.

The corpus is the institution

It is tempting to think of building a knowledge base as a data-migration project — a weekend of dragging files, a box to check. The firm in the opening thought so, which is why they dragged in forty-two hundred documents and got dumber answers. The actual work is not migration. It is editorial. It is deciding, document by document, what your firm knows to be true right now, who owns that truth, and what you are willing to let speak on the firm's behalf.

Done that way, the knowledge base stops being storage and becomes something closer to the firm's considered judgment about itself — rendered legible, retrievable, and durable against the day the person who held it in their head moves on. A shared drive is everything the firm ever did. A knowledge base, curated well, is what the firm has decided is true. The curation is not overhead on the way to the institutional memory.

The curation is the institutional memory.