Enterprise RAG: why your AI assistants still hallucinate

A buyer asks the internal assistant about the late-delivery penalty in a supplier contract. The assistant answers, confident: “0.5% per day, capped at 10%.” It’s wrong. The actual clause says 0.1%, no cap. The assistant didn’t lie, exactly. It stitched a plausible answer out of badly retrieved fragments. That’s what a RAG hallucination looks like in practice: not a dramatic crash, but a calm, well-phrased mistake.

Retrieval-Augmented Generation was supposed to fix this. Point the LLM at your documents, have it answer from them, and the inventions stop. On paper. In production, assistants keep getting things confidently wrong, and the reason is almost always the same: the model isn’t the one hallucinating, the retrieval is feeding it bad context. The “G” in RAG works fine. It’s the “R” that lets you down.

The real culprit is retrieval

A modern LLM will reason perfectly well over whatever you hand it. Give it the right three paragraphs and the answer is correct. The trouble lives upstream of the model call, in the retrieval step, where the flaws pile up unseen: the answer still sounds right, so nobody digs.

Naive chunking that cuts meaning in half

Most projects split documents into fixed N-character windows and call it done. The result: a table sliced down the middle, or a clause severed from the condition that made it true. The retrieved chunk is technically “relevant” by cosine similarity, but incomplete. The model gets half an idea and supplies the other half itself. That’s the classic breeding ground for hallucination.

Embeddings that don’t speak your domain

Generic embeddings pull together words that are semantically close “in general.” But in your domain, an “amendment” and an “appendix” are not synonyms, “provision” doesn’t carry its dictionary meaning, and a procedure reference like “PR-204-B” is, to a stock embedding, more or less noise. Vector search shines on diffuse meaning and stumbles on the exact: codes, acronyms, version numbers, proper nouns.

A top-k that’s too short, or too long

Pulling the three nearest passages feels reasonable. Except the right passage sometimes sits at position six, and you’ll never see it. Widen to top-20 and you drown the model in context: it latches onto the most repeated passage, not the most accurate one. Pick top-k by gut feel and you’re gambling with reliability.

And does your knowledge base even tell the truth?

Suppose retrieval were perfect. It still can’t be better than what it indexes. And enterprise document stores are a sedimented mess: v1, v2 and “v2_final_OK” of the same procedure all coexist, two HR notes contradict each other on leave policy, a 2019 PDF sits next to its 2024 rewrite. When retrieval surfaces both, the LLM picks. Usually the wrong one, and silently.

RAG amplifies document debt rather than hiding it. A human who hits two versions hesitates, asks around, cross-checks. The model just fuses them into one smooth, confident answer. That’s worse than no answer at all: it’s false certainty, delivered in the tone of the obvious.

Why the model fills the gaps

An LLM is trained to produce a plausible continuation, not to say “I don’t have that information.” Hand it partial context and a precise question, and it will complete the picture, because that’s its nature. With no instruction and no guardrail, staying silent simply isn’t one of its defaults.

Two blind spots show up in nearly every struggling project on top of that:

No verifiable citations. If the answer doesn’t point to the exact source passage, nobody can check it. And a claim you can’t verify ends up being taken on faith.
No relevance floor. When the best hit has a mediocre similarity score, many pipelines pass it to the model anyway. The LLM politely embroiders on it.
No success metric. Teams track how many questions get asked, never how many get answered correctly. With no number, hallucination stays an anecdote raised by an annoyed user, not a tracked defect.

That last one is the most expensive. You can’t fix what you don’t measure. Plenty of teams spend weeks polishing a system prompt while their retrieval misses one question in three. Effort on the wrong link in the chain, because nobody measured the chain.

How to make it reliable, concretely

The good news: none of these causes calls for a bigger model. It all lives in the pipeline, with components you can pick up today.

Chunk along the structure

Split by the document’s own logic (headings, sections, articles, table rows) rather than by a character count. Keep a small overlap between chunks and attach metadata to each: source, date, version, access rights, parent section. LlamaIndex and LangChain ship structure-aware splitters; for contracts or messy PDFs, a parsing pass that preserves layout pays off fast.

Search hybrid: BM25 plus vectors

Combine lexical search (BM25) with vector search. BM25 catches the exact matches embeddings miss: product codes, procedure references, proper nouns, acronyms. Vectors catch the meaning keywords miss. Elasticsearch and OpenSearch do both natively; on the PostgreSQL side, pgvector for vectors and built-in full-text for the lexical part cover most needs without stacking up components. Fuse the two rankings; don’t pick one over the other.

Rerank before the model sees it

Retrieve wide (say the top 30 to 50 candidates), then run them through a cross-encoder reranker that scores how well each passage actually answers the specific question, and keep only the top of the pile. This is often the best effort-to-payoff improvement in the whole pipeline: a well-placed re-rank fixes errors that neither chunking nor embeddings ever addressed.

Filter by freshness and rights

The metadata you set at chunking time earns its keep here. At query time, drop stale versions, favour the most recent when two documents overlap, and only expose what the user is allowed to see. Citing a repealed procedure isn’t merely getting it wrong: it’s dangerous, precisely because the answer reads as credible.

Make citations and “I don’t know” mandatory

Two non-negotiables. Every claim links back to its source passage, clickable, checkable at a glance. And when the best relevance score stays below a threshold, the assistant replies “I couldn’t find reliable information” rather than improvising. An assistant that knows when to stay quiet beats one that always has an answer. That’s counter-intuitive for sponsors, who love a demo where everything works; it’s also the condition for trust that lasts.

Measure, on every change

Build an evaluation set: 50 to 200 real questions, each with its expected answer and the passage that justifies it. Replay it on every change (new model, chunking, threshold, index) and track two things separately: did retrieval surface the right passage, and is the final answer correct. Splitting those two measures tells you which link to fix. Without that replayed set, every “improvement” is a bet.

What this means for the architecture

Put end to end, these fixes don’t add up to a magic prompt but to a processing chain with clear stages: ingestion and structured chunking, a dual lexical and vector index, hybrid search, reranking, freshness and rights filters, generation with mandatory citations and a refusal threshold, then evaluation in a loop. Orchestrating all of that by hand gets unmanageable quickly. This is where a workflow tool like N8N, with sources and tools exposed through MCP, keeps the pipeline readable, versioned and auditable — instead of a script nobody dares touch anymore.

RAG isn’t the problem. Sloppy RAG is. The gap between an assistant that invents a clause and one that cites the right paragraph rarely comes down to the model, and almost always to retrieval quality and guardrails. That’s exactly what we industrialise in our industrialisation & automation offer; for the orchestration side, see our guides & resources.