RAG Architecture for Business Documents: Vector, Vectorless, or Hybrid? | Artinoid

A compliance team deploys a RAG system across 300 insurance policy documents. QA looks fine — cosine similarity scores above 0.85, answers traced to the right files. Six weeks into production, they find it has been citing a superseded coverage clause, confidently and repeatedly, because the retrieved chunk scored high on semantic similarity. The word "liability" appeared in both the old and the new version. The model had no idea one document had replaced the other.

The problem wasn't the embedding model. It wasn't chunk size or overlap settings. The problem was that the team chose a retrieval architecture — vector-based RAG — that treats every document as a flat collection of semantically comparable fragments. For a corpus of hierarchical policy documents where the relationship between sections determines the correct answer, that choice was wrong from the start.

What RAG Architecture Actually Means — And Why the Default Is Breaking

RAG architecture is the set of decisions that govern how context gets retrieved before an LLM generates a response. Most teams treat it as infrastructure — configure it once, move on, tune the model. That framing is a mistake. The retrieval layer determines what the model sees, and what it sees determines whether its answer is grounded in the right evidence or in something that merely sounds related.

There are now three meaningfully distinct retrieval approaches. Vector RAG converts documents into chunks, embeds those chunks using a model like OpenAI's text-embedding-3 or Cohere Embed, stores the resulting vectors in a database like Pinecone, Weaviate, or pgvector, and retrieves whichever chunks are closest to the query in vector space. Vectorless RAG is architecturally different: instead of embedding chunks, it builds a hierarchical tree index from a document — a structured map of its sections and subsections — and uses an LLM to reason through that tree to find the relevant section. The primary open-source implementation is PageIndex, published by VectifyAI. Hybrid RAG combines both: vector search across a document corpus for discovery, tree-based reasoning within documents for precise extraction.

The inflection point arrived in February 2026 when VectifyAI published results for Mafin 2.5, their financial document system built on PageIndex. On FinanceBench — a benchmark using real SEC filings that tests exact answers across complex, hierarchical reports — Mafin 2.5 achieved 98.7% accuracy against roughly 30–50% for standard vector-based RAG on the same dataset (VectifyAI, GitHub, 2026). The gap exists because financial documents contain structural semantics that chunking destroys: cross-references, nested tables, footnotes that modify earlier clauses. Worth noting: this benchmark tested single, well-structured documents under controlled conditions. Multi-document performance is a different story, and it matters for how you read those numbers.

If you're debating between building better retrieval versus fine-tuning your model, the retrieval decision almost always has more leverage — a point we covered in detail in RAG vs. fine-tuning.

The Mistake Most Teams Are Making

When retrieval quality drops, most engineers adjust chunk size, increase overlap, or swap embedding models. These are sensible moves if the problem is a configuration issue. They're completely useless if the problem is architectural — and yet the architectural question almost never gets asked.

The reason is how most engineers learned RAG. The standard tutorial fixes the pipeline: chunk, embed, retrieve, generate. The architecture is assumed, not chosen. Teams tune what's tunable — chunk overlap, top-k, temperature — and never interrogate whether vector similarity search is the right mechanism for the shape of their data.

Mingtian Zhang, co-creator of PageIndex at VectifyAI, frames the core distinction sharply: vector retrieval finds text that sounds like the query, while reasoning-based retrieval finds text that logically contains the answer. For many business documents, those are not the same search. A compliance officer asking "what are the indemnification limits for third-party IP claims?" doesn't need the chunks that are semantically closest to "indemnification limits." They need the clause that legally defines that limit, which may sit in a section that doesn't mention "indemnification" at all but is cross-referenced from the one that does.

The reframe is this: your retrieval strategy should follow from the structure of your documents, not the capabilities of your LLM. Two variables dominate the decision — whether your documents have meaningful internal hierarchy, and whether you're querying deeply into a few large documents or broadly across many smaller ones. Everything else is secondary.

A Decision Framework: Matching Architecture to Document Type

Three questions determine which RAG architecture fits your use case. Answer them before you write a line of ingestion code.

Does hierarchy carry meaning in your documents? Contracts, regulatory filings, medical records, insurance policies, and technical specifications are documents where the relationship between sections is structurally load-bearing. A clause in Section 4 that modifies a definition from Appendix A can't be understood without both, and knowing they're related requires preserving structure through the retrieval process. If hierarchy matters, vector RAG will generate retrieval failures that prompt engineering cannot fix — because the problem is upstream of the prompt.

Are you querying one document deeply or many documents broadly? PageIndex and similar tree-based approaches work by building a hierarchical index per document, which requires LLM calls during ingestion and LLM reasoning during retrieval. That per-document cost is justifiable for a handful of large, authoritative source documents — a 400-page clinical trial protocol, a master services agreement, a 10-K filing. It's not justifiable when your corpus has thousands of documents and you're doing cross-corpus search. At that scale, vector retrieval is the only practical option.

Does your use case require explainable retrieval paths? In healthcare, insurance, and financial services, "the AI retrieved this" isn't an audit trail. Decisions need to be traceable to specific sections of specific documents, often for regulatory or compliance reasons. Vectorless retrieval is inherently explainable — every answer comes with a documented path through the document tree, showing exactly which node the LLM navigated to and why. Vector retrieval is not explainable in that sense. A chunk came back because its embedding was close in high-dimensional space, and that explanation satisfies almost nobody in a compliance review.

Most enterprise document AI systems don't answer all three questions the same way. A company querying across thousands of insurance policies needs cross-corpus vector search — but also needs precise, traceable extraction within each matched policy. That's where hybrid RAG becomes the correct answer, not as a compromise but as the only architecture that addresses both requirements. Getting the data pipeline architecture right at the ingestion stage is what makes hybrid retrieval tractable at scale.

The Failure Mode Nobody Is Talking About

The most common deployment mistake happening right now is applying vectorless RAG to the wrong problem because the benchmark numbers are compelling. Teams read 98.7% on FinanceBench, observe that their own vector RAG is underperforming on complex documents, and assume switching to tree-based retrieval will solve it. It won't — and the failure is expensive to discover.

An independent benchmark by ML engineer Abhijit Khuperkar, published in March 2026, tested PageIndex directly against vector retrieval across multiple enterprise document scenarios using the same FinanceBench dataset. The result: vectorless retrieval performs strongly in single-document structured environments, but vector retrieval showed approximately 40% better evidence coverage in multi-document retrieval scenarios. The failure mechanism is specific, not random. When PageIndex encountered multiple financial filings with shared section headings — "Risk Factors," "Management Discussion," "Financial Highlights" — the LLM reasoning step selected sections based on structural similarity rather than document identity. It couldn't disambiguate across documents the way embeddings can.

This doesn't make vectorless RAG wrong. It makes it wrong for a multi-document corpus. If you're building a system to answer deep questions from a single structured document, PageIndex will likely outperform your current vector pipeline significantly. If you're building a system to search across 5,000 contracts and extract relevant clauses, pure tree-based retrieval will fail in ways that are hard to debug — because the traversal logic isn't designed for cross-document disambiguation. The failure shows up as confident, structurally plausible, factually wrong answers. The worst kind.

What This Looks Like in a Real System

When we built the document intelligence system behind Artinoid's medical claims AI platform, the core retrieval challenge was this: claims adjusters needed precise answers from individual clinical records — complex, hierarchically structured documents where the relationship between a diagnosis code, a procedure note, and a prior authorisation clause within the same record determined the claim outcome.

Vector RAG produced retrieval errors here because chunking fragmented the structural dependencies. A procedure note chunk would score high on semantic similarity to a claim query. But without the adjacent diagnosis context — which was in a different chunk — the extracted answer was technically present in the source but clinically incomplete. High similarity score, wrong answer.

The solution was a two-stage pipeline: vector search at the corpus level to surface the top candidate documents for a given claim, then tree-based reasoning within each candidate to extract the precise clinical evidence with a traceable path. Vector search handled "which documents are relevant to this claim?" efficiently. Tree navigation handled "within this document, exactly what supports or contradicts this claim?" with the precision that compliance requires.

The same architecture drove the CoverWise insurance document AI: vector search across the policy corpus, structure-aware extraction within matched documents. The architecture decision — not the model selection — determined retrieval accuracy. Choosing the right retrieval approach is a design exercise before it's a machine learning problem.

Three Things to Do Before Choosing an Architecture

Audit your existing retrieval failures before changing anything. Pull 50 real queries from your production logs or evaluation set and categorise why answers failed. "Wrong document retrieved" failures are a cross-corpus recall problem — vector retrieval improvements are the likely fix. "Right document, wrong section" failures are structural retrieval problems — tree-based approaches are where you should look. Most teams find both types, which tells you the answer before you've run a benchmark.

If you find structural retrieval failures, run a controlled comparison before committing. Take your 20 worst-performing queries on structured documents and run them through PageIndex alongside your existing pipeline. PageIndex is open-source — there's no reason to make this architecture decision on benchmark data from SEC filings when you can test it on your own documents. The comparison takes days, not weeks. If you'd rather have an external team run the bake-off on your own documents, our free 1-week AI POC ships a working hybrid retrieval prototype on a sample of your corpus — with chunking, indexing, and evaluation harness already wired up — so you can see the architecture decision validated on your data before scoping a production build.

When you design the hybrid pipeline, don't skip the two-stage pattern as unnecessary complexity. Vector search to surface 3–5 candidate documents, tree-based reasoning within each candidate to extract the precise answer. The additional per-query latency is real — tree traversal adds LLM calls that vector similarity doesn't require — but for most document intelligence use cases it's well within acceptable bounds. That tradeoff should be made explicitly in your architecture review, not discovered six weeks after launch.

If you're building retrieval systems in regulated industries and want to work with a team that has shipped this in production, Artinoid's AI Engineering practice is worth a conversation.

The Architecture Is the Product

There's a widespread tendency to treat the retrieval layer as plumbing — configure it once while the real engineering happens on the model and application side. That's backwards. The retrieval layer is the epistemology of your AI system. It defines what your model is allowed to know when answering a question. Get it wrong, and you get a system that is fast, confident, and frequently incorrect.

Most teams overspend on model selection and prompt engineering and underspend on retrieval architecture. This is understandable — generation is more visible than retrieval, and failures at the generation layer are easier to diagnose. But in every enterprise document AI system we've built, the retrieval decisions — architecture type, corpus organisation, chunking strategy — have had more impact on answer quality than any model we chose.

The right question isn't whether vectorless RAG is better than vector RAG. The right question is whether your documents have structure that your retrieval system needs to respect, and whether your users need to know exactly why a specific piece of evidence was retrieved. Ask those first. The architecture follows directly from honest answers.

Talk to Artinoid's AI Engineering team about designing the right retrieval architecture for your document use case — or start with a free 1-week POC on a slice of your corpus and see the architecture working before committing to anything larger.