Enhancing Maternal Healthcare: Training Language Models to Identify Urgent Messages in Real-Time
We have fine-tuned the Gemma-2 2-billion parameter instruction model on a custom dataset in order to detect whether user messages pertain to urgent or non-urgent maternal healthcare issues. Our model demonstrates superior performance compared to GPT-3.5-Turbo in accurately distinguishing between urgent and non-urgent messages. Both the dataset and the model have been made publicly available to support further research and development in this critical area.1
Introduction
Maternal health support is critical for the well-being of mothers and the health of their children. Addressing maternal health issues reduces mortality rates and promotes equity, ensuring that all women have access to the care they need, regardless of their circumstances. Ensuring access to maternal healthcare also plays a pivotal role in economic stability by enabling mothers to effectively participate in the workforce and care for their families.
MomConnect, a South African government initiative, has been a pioneer in using mobile technology to connect pregnant women and new mothers with essential healthcare information and services. By leveraging mobile phones, the initiative has been able to reach millions of women across South Africa, particularly in remote and underserved areas. In particular, MomConnect’s help desk serves as a vital resource for pregnant women and mothers with infants by providing real-time support for maternal health inquiries. Due to the high volume of messages being sent to the help desk, it is essential to prioritize urgent messages to ensure that critical needs are met without delay.
Together, IDinsight, MomConnect, and Google.org are developing a custom model that can detect when a user is sending an urgent maternal healthcare-related message. By quickly identifying and prioritizing user inquiries, the model can aid the help desk by ensuring urgent messages are promptly forwarded to the appropriate healthcare professionals for immediate intervention. Non-urgent messages receive appropriate responses based on their nature, providing users with the necessary information and support while managing the flow of inquiries. This streamlined process not only improves the efficiency of the help desk but also ensures that all users receive timely and accurate information, particularly in situations where quick responses are essential.
Using Gemini to Simulate Maternal Health Messages
Due to the sensitive nature of maternal health messages, analyzing such data using third-party APIs like OpenAI is not always feasible or advisable. Additionally, obtaining maternal health messages at scale from developing countries presents significant challenges, particularly regarding data privacy, which complicates the training of custom models. To address these challenges, we developed a custom synthetic dataset of user messages related to maternal health.
Our approach began with an analysis of approximately 4,000 urgent and non-urgent messages collected by MomConnect. This analysis focused on creating a pool of user personas and identifying key message characteristics, such as average message length, the use of slang or emojis, and other linguistic patterns. We conducted this analysis by randomly sampling user messages from MomConnect and utilizing Gemini-1.5-Flash to generate user personas and define these message characteristics based on natural language usage. The personas and characteristics developed included linguistic and speech patterns, message conciseness or detail, and demographic information of the user base.
With these user personas and message characteristics established, we used Gemini-1.5-Flash to generate synthetic user messages aligned with randomly selected maternal health issues sourced from the Alliance For Innovation on Maternal Health (AIM). To ensure the accuracy and relevance of these synthetic messages, Gemini-1.5-Pro was subsequently employed to verify that each generated message accurately reflected the selected health issue. Messages that failed this verification process were excluded from the final dataset, ensuring that only high-quality, relevant messages were included. Figure 1 illustrates the overall data curation process.
The resulting dataset consisted of approximately 13,000 messages—~12,000 were used for training our models and ~1,000 were reserved as validation examples during the training process. Table 1 shows some examples of the generated user messages and their matching urgency rules.
Model Selection
Our objective was to develop a language model capable of being deployed on cost-effective hardware while producing output in a structured JSON format to facilitate easy extraction of answers and seamless message transmission via APIs.
To achieve this, we selected the Gemma-2 instruction-following models for fine-tuning. Gemma is a family of lightweight, state-of-the-art open models developed by Google, leveraging the same foundational research and technology that underpins Google’s Gemini models. Gemma models are particularly well-suited for a variety of text generation tasks, including question answering, reasoning, and instruction-following.
Training Our Models
The input to our model is a user message and a list of urgency rules:
user_message = "If my newborn can't able to breathe what can i do"
urgency_rules = [
"NOT URGENT",
"Bleeding from the vagina",
"Bad tummy pain",
"Bad headache that won’t go away",
"Changes to vision",
"Trouble breathing",
"Hot or very cold, and very weak",
"Fits or uncontrolled shaking",
"Baby moves less",
"Fluid from the vagina",
"Feeding problems",
"Fits or uncontrolled shaking",
"Fast, slow or difficult breathing",
"Too hot or cold",
"Baby’s colour changes",
"Vomiting and watery poo",
"Infected belly button",
"Swollen or infected eyes",
"Bulging or sunken soft spot",
]
Given these inputs, the Gemma-2 model produces the following JSON output:
{
"best_matching_rule": "The rule that best matches with the user message.",
"probability": "A probability between 0 and 1 in increments of 0.05 that any part of the user message matches one of the urgency rules.",
"reason": "The reason for selecting the best matching rule and probability."
}
The Gemma-2 models were trained using the HuggingFace library on Google Cloud Platform compute instances equipped with NVIDIA A100 80GB GPUs, using parameter-efficient tuning and 8-bit quantization. Each training session ranged from 36 to 48 GPU hours. Our training hyperparameters are as follows:
- Per device batch size: 16
- Number of training epochs: 3
- Optimizer: Paged AdamW 8-bit
- Warmup ratio: 0.03
- LoRa r: 64
- LoRa alpha: 32
- LoRa dropout: 0.05
Results
We benchmarked our fine-tuned model against the GPT-family of models from OpenAI. All models are presented with the same instruction format and inputs, and asked to produce the same JSON output. We evaluate the models on a held-out test set of 3000 messages. Since most (real) user messages are non-urgent, there exists an imbalance in the test set. Thus, we report both accuracy and Area Under the Curve (AUC). The main results are shown in Table 2 below. As a baseline, we also show results from the base Gemma-2 2-billion instruction model (i.e., without any fine-tuning).
For accuracy, we extract the value for the best_matching_rule
key in the JSON output
and compare it against the ground-truth label. The ground-truth labels are binary
Yes/No
labels in this case. Thus, an answer is deemed accurate if the best matching
rule is one of the provided urgency rules and the ground-truth label is Yes
(regardless of which urgent rule it matches against) or if the best matching rule is
NOT URGENT
and the ground-truth label is No
. For AUC, we use the value for the
probability
key in the JSON output as a proxy for the model’s confidence that the
message is urgent/non-urgent. The Receiver Operating Characteristic (ROC) curves for
the different models are shown in Fig. 2.
Detailed instructions on how to set up the model for inference are provided here.
What’s Next?
Training a multi-billion parameter model comes with a lot of challenges. Data curation is cheap (we spent about $50 USD total, including experimenting with different curation procedures) but difficult to get right. Model training also requires expensive hardware and is time consuming, making it tedious to optimize over the various training hyper-parameters. We can improve upon our results in three ways:
- Larger, more diverse training dataset for supervised fine-tuning
- Including ambiguous messages for preference optimization
- Fine-tuning the Gemma-2 9-billion variant
Teaching a large language model to reason effectively and follow instructions requires a large amount of diverse, high-quality training data. Our data curation process generated ~13k examples in total, of which only ~12k was used for training. Due to the relatively small training set, overfitting occurred after ~2.5 epochs. Although our 2-billion parameter model exceeds GPT-3.5-Turbo’s performance, it still lags behind GPT-4o and GPT-4o-mini. We would like to see the Gemma-2 model match the performance of GPT-4o-mini—a larger, more diverse training set can help in this case.
Our data curation process naturally biases toward messages that have a high probability
of being urgent/non-urgent since the language model that verifies the synthetically
generated user messages is instructed to ensure that each message aligns with its
selected urgency rule with high confidence (NOT URGENT
is also one of the rules).
Manual inspection of incorrect predictions by our model revealed several cases where
the user message is ambiguous in terms of whether it constitutes an urgent or
non-urgent maternal health-related issue. These are instances where a judgement call is
typically made on-the-fly by a human operator. For these cases, we can teach the model
to prefer one output vs. another using training techniques such as
Direct Preference Optimization (DPO). However, our
synthetic dataset may lack sufficient diversity for DPO. Thus, we need to construct
pairs of synthetic messages with high/low confidence for a given rule so that
preference fine-tuning can be properly done after supervised fine-tuning.
Finally, a larger supervised fine-tuning dataset coupled with a preference dataset will allow us to experiment with the Gemma-2 9-billion instruction model. There is evidence that larger-sized language models exhibit emergent capabilities that allow them to reason more effectively over complex instructions. Hosting a 9-billion parameter model for inference is challenging but quantization and knowledge distillation techniques can help reduce model size and associated costs.
Conclusion
By integrating advanced technology with community-based healthcare, the collaboration between IDinsight, MomConnect, and Google.org is setting a new standard for maternal health support, with the potential to be replicated in other regions facing similar challenges.
Through this initiative, we aim to improve the responsiveness and effectiveness of maternal healthcare services by leveraging AI to identify and prioritize urgent messages. Our fine-tuned models not only preserve data privacy but also provide a scalable solution to address the challenges of maternal health in developing regions.
We invite you to explore the publicly available dataset and model, and we look forward to further collaborations that can help advance maternal healthcare globally.
Acknowledgements
We are grateful to Google.org for funding this project through the AI for Globals Grant and for providing valuable insights throughout the process.