di , 08/01/2024

Healthcare professionals have increasingly been exploring the potential of Large Language Models (LLMs) like GPT-4 to revolutionize aspects of patient care, from streamlining administrative tasks to enhancing clinical decision-making.

Despite their potential, language models can encode biases, impacting historically marginalized groups.

A recent study published by The Lancet Digital Health has shed light on potential pitfalls, emphasizing the need for cautious integration into healthcare settings.

The study

In this study, researchers delved into whether GPT-4 harbors racial and gender biases that could adversely impact its utility in healthcare applications. Using the Azure OpenAI interface, the team assessed four key aspects: medical education, diagnostic reasoning, clinical plan generation, and subjective patient assessment.

To simulate real-world scenarios, researchers employed clinical vignettes from NEJM Healer and drew from existing research on implicit bias in healthcare. The study aimed to gauge how GPT-4’s estimations aligned with the actual demographic distribution of medical conditions, comparing the model’s outputs with true prevalence estimates in the United States.

The study assessed GPT-4’s potential biases in medical applications, including medical education, diagnostic reasoning, clinical plan generation, and subjective patient assessment.

  • Simulating Patients for Medical Education:
    • GPT-4 was evaluated for creating patient presentations based on specific diagnoses, revealing biases in demographic portrayals.
    • The analysis included 18 diagnoses, assessing GPT-4’s ability to model demographic diversity and comparing the generated cases with true prevalence estimates.
    • Various prompts and geographical factors were considered, and strategies for de-biasing prompts were explored.
  • Constructing Differential Diagnoses and Clinical Treatment Plans:
    • GPT-4’s response to medical education cases was analyzed, evaluating the impact of demographics on diagnostic and treatment recommendations.
    • Cases from NEJM Healer and additional scenarios were used, examining the effect of gender and race on GPT-4’s outputs.
    • Two specific cases, acute dyspnea and pharyngitis in a sexually active teenager, underwent a more in-depth analysis.
  • Assessing Subjective Features of Patient Presentation:
    • GPT-4’s perceptions were examined using case vignettes designed to assess implicit bias in registered nurses.
    • Changes in race, ethnicity, and gender were introduced to measure the impact on GPT-4’s clinical decision-making abilities across various statements and categories.
    • The study aimed to identify significant differences in GPT-4’s agreement with statements based on demographic factors.

The results were concerning.

GPT-4 consistently generated clinical vignettes that perpetuated stereotypes related to demographic presentations, failing to accurately model the diversity of medical conditions.
The differential diagnoses provided by the model were more likely to include stereotypical associations with certain races, ethnicities, and genders.
Additionally, assessments and plans created by GPT-4 revealed significant associations between demographic attributes and recommendations for more costly procedures, as well as variations in patient perception.

These findings underscore the critical importance of subjecting LLM tools like GPT-4 to thorough and transparent bias assessments before their integration into clinical care. The study discusses potential sources of biases and proposes mitigation strategies to ensure responsible and ethical use in healthcare settings.

Priscilla Chan and Mark Zuckerberg provided funding for this research, which actively calls on the healthcare community to approach the integration of advanced language models with caution and a commitment to mitigating biases for improved patient care.