The reasoning bottleneck in Graph-RAG: Structured prompting and context compression for multi-hop QA
Date
2026
Authors
Zarrinkia, Yasaman
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Multi-hop question answering requires connecting facts scattered across multiple documents, a task where standard retrieval often falls short. Graph-based retrieval augmented generation (Graph-RAG) addresses this by building knowledge graphs from document collections and retrieving structured context that preserves entities, relations, and community summaries. Yet strong retrieval does not guarantee strong answers. This thesis studies the reasoning bottleneck in Graph-RAG and asks whether inference-time augmentations, requiring no retraining or re-indexing, can close the gap between what the retriever provides and what the model can actually use.
Evaluating KET-RAG, a leading Graph-RAG system, on three multi-hop bench marks (HotpotQA, MuSiQue, 2WikiMultiHopQA), the experiments show that 77% to 91% of questions already have the correct answer somewhere in the retrieved con text, yet accuracy reaches only 23% to 67% for a budget 8-billion-parameter model and 35% to 78% for a 70-billion-parameter baseline. Decomposing errors reveals that 73% to 84% are reasoning failures: the answer was there, but the model could not use it. Reasoning, not retrieval, is the dominant bottleneck.
To address this, the thesis studies three inference-time mechanisms. First, SPARQL style chain-of-thought prompting decomposes questions into structured triple-pattern queries that mirror the entity–relationship layout of the retrieved context, improving accuracy by +2 to +14 percentage points. Second, graph-walk compression reduces the retrieved context by approximately 60% through knowledge-graph traversal with no additional model calls, adding +6 percentage points on average when paired with structured prompting on smaller models. Third, question-type routing selects be tween prompting strategies based on question structure. Surprisingly, combining all three enables a budget 8B model to match or exceed the unaugmented 70B baseline on all three benchmarks at approximately 12× lower inference cost. A transfer experiment on LightRAG, a second Graph-RAG system, confirms that structured prompting generalises across systems, while graph-walk compression works best when the retrieval pipeline produces clearly layered context, where some retrieved evidence is more central to the question than other material.
The main contributions are a quantitative decomposition of Graph-RAG errors into retrieval and reasoning failures, evidence that structured prompting is a system agnostic reasoning augmentation, and the finding that low-cost inference-time interventions can close the gap between a budget model and a much larger baseline without iv any change to the retriever or index.