Measuring the reliability of LLM-based clinical data extraction
DryLabz has published its first scientific paper.
https://arxiv.org/abs/2606.05970
The work started from a simple but important question: if a large language model reads clinical notes and converts them into structured data, how stable are the results?
Hospitals hold large amounts of valuable information in discharge summaries, reports, PDFs, and narrative notes. Much of this information remains difficult to use for analytics, prediction, quality improvement, or clinical decision support because it is stored as free text.
Large language models can help convert this text into structured fields. But for clinical use, the harder question is whether these structured outputs remain stable when the extraction setup changes.
In our study, we tested this on clinical discharge summaries. We ran the same extraction task with different prompt wordings, different model sizes, and different schema choices, then measured how much the outputs changed.
One of the clearest findings was that model choice had a larger effect on the primary admission category than prompt wording. Changing the model reassigned the main category in close to half of the notes, while rewording the prompt changed it in roughly one in eight cases.
We also observed a recurring issue around the difference between “no” and “not documented.” A note stating “no kidney injury” carries different information from a note that simply does not mention kidney injury. If an AI system treats both as equivalent, it can create false certainty from missing documentation.
This matters for clinical data pipelines. The output may look clean and structured, while still depending on hidden configuration choices such as prompt version, model version, and schema design.
Our paper proposes a method to measure these movements before structured LLM outputs are used for analytics, prediction, or clinical workflows.