Skip to content

Evaluation

Evaluating GraphRAG vs. RAG on Real-World Messages

📚 View all posts in the Graph-based Healthcare Series

Graph-based Healthcare Series — 6

This is the sixth post in our ongoing series on graph-based healthcare tools. Stay tuned for upcoming entries on clinical modeling, decision support systems, and graph-powered AI assistants.

In our previous post, we introduced a Bayesian diagnostic engine that uses synthetic patient data to quantify clinical evidence, score conditions, and update probabilities in a way that mirrors how clinicians think.

In this post, we zoom out from system architecture and generative modeling to answer a practical question:

How does GraphRAG perform on real patient messages compared to a traditional RAG system?

To evaluate this, we tested both systems on real-world caregiver messages from Last Mile Health (LMH), labeled and scored by GPT-4o, across over 300 cases. The results provide a compelling look at the strengths, weaknesses, and tradeoffs of graph-structured retrieval in clinical QA tasks.