Evaluating GraphRAG vs. RAG on Real-World Messages

📚 View all posts in the Graph-based Healthcare Series

Graph-based Healthcare Series — 6

This is the sixth post in our ongoing series on graph-based healthcare tools. Stay tuned for upcoming entries on clinical modeling, decision support systems, and graph-powered AI assistants.

In our previous post, we introduced a Bayesian diagnostic engine that uses synthetic patient data to quantify clinical evidence, score conditions, and update probabilities in a way that mirrors how clinicians think.

In this post, we zoom out from system architecture and generative modeling to answer a practical question:

How does GraphRAG perform on real patient messages compared to a traditional RAG system?

To evaluate this, we tested both systems on real-world caregiver messages from Last Mile Health (LMH), labeled and scored by GPT-4o, across over 300 cases. The results provide a compelling look at the strengths, weaknesses, and tradeoffs of graph-structured retrieval in clinical QA tasks.

Dataset: Real-World Messages from LMH

We sampled ~3,000 frontline health messages sent between February and July 2025 via LMH’s caregiver programs. These messages ranged from simple status updates to complex diagnostic prompts.

Filtering for Medical Questions

Not all messages are clinically actionable, and most do not map directly to IMNCI protocols.

To extract a high-quality benchmark set:

We used GPT-4o to filter for messages that were:
- Clinically relevant to IMNCI
- Diagnosable based on existing IMNCI classifications
For each qualifying message, GPT-4o also generated the ground-truth classification labels (1–2 per message on average)

This resulted in a filtered benchmark set of 320 messages, each labeled with one or more IMNCI classifications.

NB: While LLM-based labeling introduces subjectivity, we applied real-time retries, validation checks, and ran both RAG and GraphRAG pipelines through the same scoring procedure to ensure fairness.

Evaluation Setup

Shared Constraints and Controls

Both pipelines used similar models and constraints:

Retrieval and generation with Gemini Flash models (2.0 and 2.5)
Top-8 classification predictions returned per message (based on earlier synthetic eval tuning)
Same number of API calls per pipeline to maintain parity
Same evaluation logic and scoring criteria

🚫 🍎: While absolute parity is impossible, we controlled for model capacity, retrieval scope, and number of inference steps. The system internals for the RAG and GraphRAG pipelines differ and we tried our best to keep conditions aligned wherever possible.

Metrics Used

We measured three retrieval-focused metrics:

Metric	Definition
All Hits @ k	% of cases where all ground truth classifications appear in top-k
Any Hits @ k	% of cases where at least one ground truth classification appears
Average Recall	Mean recall across all cases (what fraction of correct labels were found)

These metrics reflect the core of what diagnostic retrieval is about: finding relevant classifications accurately and reliably.

Results

RAG Performance (Single Run)

Metric	Result
All Hits @ 8	56.6%
Any Hits @ 8	87.8%
Average Recall	71.8%

Additional Notes:

❌ 15/320 messages returned no classifications (despite every message having valid labels)
❌ 5 classifications were hallucinated (not present in IMNCI graph, despite being included in the evaluation prompt)

GraphRAG Performance (Averaged Over 3 Runs)

Metric	Result (± stddev)
All Hits @ 8	83.0% ± 3.6%
Any Hits @ 8	97.3% ± 0.9%
Average Recall	91.2% ± 2.2%

Additional Notes:

✅ No hallucinations across all runs
✅ Zero null results: GraphRAG always returned at least one classification

Key Takeaways

1. GraphRAG Outperforms, Decisively

Across all three metrics, GraphRAG beats RAG by a wide margin. Most notably:

26% improvement in All Hits @ 8
Nearly 20% improvement in Average Recall
100% success in returning relevant classifications (vs. 95% for RAG)

This reinforces a key belief: structure matters, especially in high-stakes, retrieval-intensive domains like healthcare.

2. Fairness ≠ Sameness

Even though we used similar models and prompting, the pipelines are inherently different. We made the comparison as fair as possible but, nevertheless, it's critical to interpret the results in context.

RAG retrieves semantically similar text from an indexed corpus and is more suitable for general knowledge tasks
GraphRAG executes logical traversals over structured medical relationships and is designed for precise knowledge retrieval

They both generate answers but one is traversing a vetted clinical knowledge graph, while the other is searching for linguistic overlap.

3. No More Context Poisoning

Perhaps the most important observation isn’t about numbers. Rather, it’s about downstream safety.

RAG is lossy: its retrieved chunks can get diluted, and the next-step reasoning (e.g., treatment) may suffer from hallucinated or missing classifications in previous steps.
GraphRAG is grounded: if a classification is retrieved, all associated treatments and procedures can be fetched 100% reliably from the graph.

This creates a safety guarantee for GraphRAG that is difficult to engineer for RAG systems.

4. Generalization Holds

We previously ran the same GraphRAG pipeline on 4,000+ synthetic patient cases generated via IMNCI rules. The performance profile was strikingly similar:

High recall
Zero hallucinations
Reliable grounding

This suggests that GraphRAG generalizes well across both synthetic and real-world settings.

Conclusion

While both RAG and GraphRAG are viable options for clinical decision support, our results show that structured retrieval using knowledge graphs offers clear advantages in:

Accuracy
Compositional reasoning
Reliability
Safety

The takeaway is not that one approach is "better" than the other. It’s that your data structure shapes your system’s capabilities.

In clinical decision-making, data structure is more than just an implementation detail. It’s also a safety-critical design choice. Our findings affirm that graphs aren't just a nice-to-have abstraction. They’re an enabling technology for reliable, interpretable, and high accuracy clinical AI.

We hope this deep dive into real-world retrieval performance helps clarify the tradeoffs between unstructured and graph-structured approaches in clinical AI.

Thanks for reading! If you're working on interpretable AI, pediatric decision support, or graph-based healthcare tools, we'd love to hear from you!

⬅️ Previous: Bayesian Pattern Recognition for Real World Applications

➡️ Next up: Longitudinal Patient Care