Simulating Real-World Pediatric Encounters Using Large Language Models

📚 View all posts in the Graph-based Healthcare Series

Graph-based Healthcare Series — 4

This is the fourth post in an ongoing series on graph-based healthcare tools. Stay tuned for upcoming entries on clinical modeling, decision support systems, and graph-powered AI assistants.

In our previous post, we demonstrated how agentic flows can transform diagnosis from a reactive retrieval task into a guided, context-aware reasoning process. By orchestrating modular assistants, tracking physician intent, and dynamically adapting based on feedback, we built a collaborative diagnostic experience that’s explainable, flexible, and clinically grounded.

In this post, we shift focus to the synthetic data generation side of the equation. We detail the steps taken to generate a diverse set of synthetic patient cases—each featuring unique symptoms, conditions, and diagnostic paths. These examples simulate a wide range of realistic clinical scenarios, laying the foundation for applying Bayesian pattern recognition methods to richly structured, verifiable patient data.

Why Synthetic Patient Data?

Building high-quality diagnostic systems often requires access to richly labeled, diverse, and private clinical data—something difficult to obtain in real-world healthcare settings. Synthetic patient data offers several benefits:

Privacy preservation: Enables open experimentation without compromising patient confidentiality.
Data diversity: Rare cases or underrepresented presentations can be simulated and upsampled.
Regulatory alignment: Supports development in environments with strict data access controls.
Training augmentation: Fills gaps where real patient records are limited or noisy.
Controlled evaluation: Synthetic data allows precise control over edge cases, uncertainty levels, and style variation, making it easier to benchmark model performance.

By generating synthetic data grounded in the Integrated Management of Neonatal and Childhood Illness (IMNCI) protocols, we preserve clinical realism while enabling large-scale experimentation in probabilistic reasoning.

Generating Synthetic Data with the Help of LLMs

We use the gemini-2.5-pro-preview-03-25 model to generate synthetic patient cases based on IMNCI graph data. Generation is controlled via the following configuration:

{
    "max_output_tokens": 8192,
    "model": "gemini-2.5-pro-preview-03-25",
    "n": 1,
    "temperature": 0.7,
    "top_p": 0.9
}

The input to the model consists of the structured IMNCI data, including the complete set of classifications, conditions, and associated observations for each age group. Classification nodes are randomly sampled during generation to ensure diagnostic diversity and balanced representation across clinical categories.

Key Parameters for Diverse Case Generation

To create a realistic and varied dataset, we control 16 generation parameters per sample. These dimensions influence everything from tone and urgency to linguistic style and diagnostic complexity.

The following table summarizes the parameters used to vary case structure, style, and diagnostic content.

Parameter Overview: Controlling Case Diversity

Category	Parameter	Description	Example Values
Patient Context	age_group	Age of the patient	`Birth to 2 Months`, `2 Months to 5 Years`
	clinic_type	Resource level and clinical setting	`rural_post`, `district_hospital`, `emergency_center`
	physician_experience_level	Style and tone of documentation	`junior` (shorter notes, more uncertainty, casual phrasing), `senior` (structured notes, richer details, confident phrasing)
Note Complexity	entangled	Whether to combine multiple diagnoses into one note	`True`, `False`
	min_conditions	Minimum number of diagnoses	`1`
	max_conditions	Maximum number of diagnoses	`2`, `3`, or `4` (weighted random)
	force_category_diversity	Force selection from different diagnostic categories across the observation note (e.g., Dehydration, Jaundice, Birth Asphyxia, HIV, TB, etc.)	`True`, `False`
Language Variability	length_variability	Sentence/paragraph length diversity	`tight` (1–3), `wide` (1–8)
	style	Style and form of clinical language	`gold` (polished, structured, grammatically complete, precise), `noisy` (rushed, minimal grammar, typos, missing measurements), `interrupted` (sentences may trail off, sudden breaks allowed), `sms` (heavy use of abbreviations, text-style, lower-case), `mixed` (mix multiple styles in the observation note)
	noise_injection_strength	Controls additional grammatical errors, word dropouts, wrong abbreviations, interruptions	`0.0` (perfect) to `1.0` (very noisy)
Diagnostic Complexity	normal_visit_probability	Chance that the observation note describes a normal patient visit (e.g., healthy feeding, breathing well, no complications)	0.0 – 1.0
	rare_case_probability	Probability of a rare diagnosis appearing	0.0 – 1.0
	include_uncertainty	Introduce natural clinical uncertainty allow realistic uncertainties (e.g., "mother unsure", "device unavailable", etc.)	`True`, `False`
	specificity	Level of measurement or phrasing precision	`low` ("breathing fast"), `medium` ("maybe 60 bpm"), `high` ("respiratory rate = 65 bpm")
Human-Centric Detail	caregiver_quotes	Whether caregiver quotes are included (e.g., "Mother says baby feeding poorly.")	`True`, `False`
	urgency_level	Tone of the case: routine or emergency	`normal` (routine documentation flow), `high` (life-threatening situations, rushed priorities, focus only on critical signs)

Each data sample is returned in the following structured JSON format:

{
    "observation_note": "The physician's observation note goes here. Should be natural, human clinical language, not bullet points or structured fields.",
    "explanation": "A concise and detailed explanation of the key observations and reasoning behind the observation note.",
    "ground_truth_classifications": [
        "List containing the **exact medical classification name(s)** from the provided JSON that were used to generate the observation note."
    ],
    "ground_truth_conditions": [
        "List containing the **exact medical condition name(s)** from the provided JSON corresponding to ground_truth_classifications."
    ],
    "ground_truth_observations": [
        "List containing the **exact patient observation(s)** from the provided JSON corresponding to ground_truth_classifications."
    ]
}

Examples of Generated Synthetic Cases

{"observation_note": "Neonate appears small for age, visually estimated wt around 2.1 kg. Mother states infrequent feeding, maybe 6 times yesterday. Attachment poor during brief observation.", "explanation": "The observation note reflects a senior physician's assessment in a rural post with limited resources. It combines two related issues (low birth weight and feeding problem) as per 'entangled=True'. The specificity is high regarding estimated weight and feed frequency, despite the lack of a scale ('rural_post'). The 'mixed' style and moderate 'noise_injection_strength' are represented by the abbreviation ('wt'), slightly informal phrasing ('maybe 6 times'), and concise structure. The length is 'tight' (2 sentences). No uncertainty is included, and it's not a normal visit or rare case based on the probabilities.", "ground_truth_classifications": ["LOW BIRTH WEIGHT AND/OR PRETERM", "FEEDING PROBLEM OR UNDERWEIGHT - BREASTFEEDING"], "ground_truth_conditions": ["Birth Weight and Gestational Age (<7 Days Old)", "Feeding Problem or Underweight - Breastfeeding"], "ground_truth_observations": ["Weight 1,500 - 2,500 gm", "Less than 8 breastfeeds in 24 hours", "Not well attached to breast"], "generation_config": {"age_group": "birth to 2 months", "caregiver_quotes": false, "clinic_type": "rural_post", "entangled": true, "force_category_diversity": true, "include_uncertainty": false, "length_variability": "tight", "max_conditions": 2, "min_conditions": 1, "noise_injection_strength": 0.48, "normal_visit_probability": 0.47, "physician_experience_level": "senior", "rare_case_probability": 0.07, "specificity": "high", "style": "mixed", "urgency_level": "normal"}}
{"observation_note": "Infant brought in, v irritable and crying nonstop. Mum reports lots of watery stool, maybe saw some blood but shes not sure. Skin pinch test seems a bit slow to go back?? looks quite unwell. Needs urgent attention.", "explanation": "The note describes a young infant presenting with symptoms suggestive of dysentery. Observations include significant irritability, reported diarrhea with possible blood (uncertainty included), and potentially slow skin turgor (medium specificity, uncertainty). The tone reflects high urgency and a junior physician's perspective in a district hospital setting, with mixed sentence structure and minor grammatical imperfections consistent with the 'mixed' style and noise level.", "ground_truth_classifications": ["DYSENTERY"], "ground_truth_conditions": ["Dysentery"], "ground_truth_observations": ["Restless, irritable", "Blood in stool", "Skin pinch goes back slowly"], "generation_config": {"age_group": "birth to 2 months", "caregiver_quotes": false, "clinic_type": "district_hospital", "entangled": false, "force_category_diversity": true, "include_uncertainty": true, "length_variability": "wide", "max_conditions": 3, "min_conditions": 1, "noise_injection_strength": 0.45, "normal_visit_probability": 0.46, "physician_experience_level": "junior", "rare_case_probability": 0.14, "specificity": "medium", "style": "mixed", "urgency_level": "high"}}
{"observation_note": "Baby looks a bit yellow... skin mostly maybe eyes too? Mother says baby is... maybe 10 days old? Hard to tell exactly. Feeding seems... not great. Attachment looks off, baby keeps slipping off the breast. And uhm... how many feeds in 24hrs? Mum not sure... maybe 5 or 6? seems low. Skin turgor ok though. No scale here so cant check weight.", "explanation": "The note describes a young infant presenting with yellowish skin and eyes, consistent with jaundice, estimated age around 10 days. It also notes feeding difficulties, specifically poor attachment to the breast and possibly infrequent feeding (mother uncertain, maybe 5-6 times/day), suggesting a feeding problem. The style is interrupted, reflecting a potentially rushed or distracted documentation process typical of a junior physician in a rural post with limited resources (no scale mentioned). Uncertainty is included regarding the exact age and feeding frequency. Specificity is medium ('a bit yellow', 'maybe 10 days', 'maybe 5 or 6 feeds'). Noise (0.35) is reflected in minor grammatical issues ('cant check weight'). Two conditions (Jaundice and Feeding Problem) are entangled.", "ground_truth_classifications": ["JAUNDICE", "FEEDING PROBLEM OR UNDERWEIGHT - BREASTFEEDING"], "ground_truth_conditions": ["Jaundice", "Feeding Problem or Underweight - Breastfeeding"], "ground_truth_observations": ["Only skin on the face or eyes yellow", "Infant aged 24 hrs-14 days old", "Not well attached to breast", "Less than 8 breastfeeds in 24 hours"], "generation_config": {"age_group": "birth to 2 months", "caregiver_quotes": false, "clinic_type": "rural_post", "entangled": true, "force_category_diversity": false, "include_uncertainty": true, "length_variability": "wide", "max_conditions": 2, "min_conditions": 1, "noise_injection_strength": 0.35, "normal_visit_probability": 0.45, "physician_experience_level": "junior", "rare_case_probability": 0.05, "specificity": "medium", "style": "interrupted", "urgency_level": "normal"}}

These examples demonstrate diversity in case severity, narrative style, physician voice, and patient background—while always grounding observations in IMNCI-defined classifications.

Verifying Synthetic Data Using Bayesian Truth Serum

To ensure quality and label fidelity, we apply a two-step validation process.

Step 1: Structural Ground-Truth Checks

During generation, each sample is immediately checked to ensure consistency between the observation note and the IMNCI-derived ground truth labels (classifications, conditions, and observations). Invalid mappings are rejected on the spot.

Step 2: Multi-Agent Validation Using LLMs

We then run a second validation stage inspired by the Bayesian Truth Serum (BTS). BTS is a mechanism to elicit truthful subjective judgments—even when the "ground truth" is ambiguous—by rewarding responses that are more common than predicted by peers.

To operationalize BTS in an LLM setting, we prompt two verifier agents to evaluate each case—each under a different assumption about peer consensus. The twist: each verifier is told how its peers responded.

Verifier A is told: "Only 1 out of 10 of your peers agreed with you. 9 disagreed and offered alternatives."
Verifier B is told: "9 out of 10 peers agreed with your conclusion. 1 dissenting peer offered alternatives."

Each verifier is asked to either stick to their original classification or revise based on peer disagreement, and explain why with the following JSON response format:

{
    "bayesian_truth_serum_decision": "stick" | "revise",
    "bayesian_truth_serum_reasoning": "Explain in 2–3 sentences why you chose to **stick to your guns** or why you decided to **revise your conclusions after considering the peer disagreement.**",
    "issues_found": [
        "**If you decided to revise**, list any issues you found in your own conclusions (e.g., missing symptoms, misclassification, missing conditions, wrong observations, etc.). Otherwise, set this key to an empty list []."
    ],
    "alternative_suggestions": [
        {
            "classification_name": "An IMNCI medical classification that you believe better fits the patient based on the peer disagreement and after reviewing the provided IMNCI protocols.",
            "condition_name": "The IMNCI medical condition name that corresponds to classification_name (note that this value must also come from the provided IMNCI protocols).",
        },
        ...
    ]
}

We categorize samples based on verification outcomes:

Both verifiers agree with the generator → High-confidence samples
Both verifiers disagree with the generator → Alternative-path samples
Verifiers disagree with each other → Ambiguous “hard” examples for model training

This BTS-inspired process gives us not only high-quality synthetic cases, but also metadata about label certainty and diagnostic ambiguity—critical for downstream inference.

Augmenting IMNCI Data Using Verified Synthetic Data

The verified synthetic samples serve as a powerful extension of the original IMNCI dataset. Each one is grounded in clinical logic, structurally validated, and semantically evaluated. These samples enable us to:

Enrich sparse or abstract IMNCI observations by grounding them in natural language keyphrases and note patterns extracted from the synthetic data
Populate new nodes and relationships in our graph data model with plausible, complex, real-world patient cases
Train retrieval or classification models with controlled uncertainty
Simulate edge cases or decision boundaries that the original data omits

By incorporating this dataset into our knowledge graph and pattern recognition tools, we can model a more robust, data-rich picture of pediatric patient care.

Limitations on Synthetic Note Quality and Future Work

While this synthetic data pipeline offers strong alignment with IMNCI logic and rich control over case diversity, several limitations remain:

Real-world fidelity: LLMs may occasionally over-generalize or introduce subtle inconsistencies not present in expert-authored notes.
Bias amplification: If the IMNCI graph omits certain nuances or edge cases, these limitations may propagate through the generator.
Verifier assumptions: The BTS-based LLM verification relies on simulated consensus; human validation remains ideal for deployment-grade datasets.

Future work will include benchmarking synthetic cases against real pediatric notes (where available), incorporating feedback from clinical experts, and expanding our graph to capture evolving medical logic.

What's Next for Patient Diagnosis

With a curated library of verified synthetic encounters in place, we’re now ready to explore Bayesian pattern recognition as the next building block for diagnostic reasoning. This will allow us to:

Identify latent patterns across patient presentations
Score likely diagnoses probabilistically
Surface ambiguous or high-risk cases for extra scrutiny

In the next post, we’ll walk through how we apply Bayesian methods to learn structure from the synthetic dataset, and how this enables scalable, uncertainty-aware clinical reasoning in dynamic environments.

Thanks for reading! If you're working with real-world patient data or interested in generating or validating synthetic clinical content, we'd love to hear from you!

⬅️ Previous: Managing Agentic Flows with Pydantic Graph

➡️ Next up: Bayesian Pattern Recognition for Real World Applications