Collective learnings is key as the healthcare industry explores how large language model based tools can be safely designed and deployed

Large Language Models and generative AI tools like ChatGPT have made a big splash in recent months, and now questions proliferate about how these technologies might be used in healthcare settings. With new technologies, we must look not only at their potential utility and power but also at the limitations that we must address as we design and deploy solutions using that technology.

Sponsored byMicrosoft-logo_rgb_c-gray

While LLMs are fundamentally different from the types of Healthcare AI tools we already see deployed, LLMs present some of the same challenges to responsible design and deployment – including ensuring the tools are safe, efficacious, and reliable. We believe that many of the same practices – Responsible AI practices – that were used to bring the current generation of Healthcare AI tools to market will be useful as the industry considers LLM-based tools.

How does generative AI work?

GPT-4[i] is a powerful LLM designed to generate human-like text based on input prompts by utilising its advanced understanding of language structure. OpenAI’s GPT models have been trained on vast amounts of textual data and excel at tasks such as natural language processing, translation, summarisation and question-answering. ChatGPT[ii] is an application that puts a natural language chat interface around GPT-4, resulting in a powerful AI assistant. As powerful as they are, it is important to consider that while LLMs generate compelling, human-sounding output, they may sometimes generate inaccurate output (commonly referred to as “hallucinations”).[iii]


Flavia Rovis

John Doyle Headshot

John Doyle

Steve Mutkoski Headshot

Steve Mutkoski

What are the potential capabilities and limitations of LLMs in healthcare?

While LLMs like GPT-4 are general-purpose technologies not specifically designed for healthcare uses, in the future they will likely be utilised in solutions in the healthcare domain. Microsoft Research and OpenAI explored the potential uses of LLMs in healthcare settings, looking at both capabilities and limitations.[iv] Subsequently, researchers from Microsoft and Stanford’s Human-Centered Artificial Intelligence program conducted a quantitative assessment of GPT-4’s capacity to enhance healthcare professionals’ performance, offering valuable insights into the accuracy and limitations of outputs generated by LLMs like GPT-4.[v] These two articles provide valuable insights into both the opportunities and challenges that lie ahead in determining where and how LLMs can be utilised in the healthcare domain.

What are the potential capabilities and limitations of LLMs in healthcare settings?

The New England Journal of Medicine study suggests potential applications of using LLMs in medicine that include:

  • Presenting information to clinicians or patients by searching through data obtained from public sources
  • Providing background information about a patient or a summary of lab results
  • Reducing administrative burdens of clinicians by generating clinical notes or assisting with repetitive tasks
  • Supporting medical education and research with the ability to summarise research articles

In their quantitative evaluation of LLM performance in “curbside consultations”, the Microsoft and Stanford research team highlighted some important limitations of GPT-3.5/GPT-4. The research suggests that these LLMs may not meet clinicians’ real-world information needs “out of the box”, meaning developers will need to address model limitations in the solutions they build with the LLMs. The researchers highlighted that while GPT-4’s responses in curbside consultations were deemed “safe”, 93 per cent of the time, only 41 per cent of GPT-4’s responses aligned with the known answer when assessed by human reviewers. One interpretation of these datapoints is that GPT-4’s output was “safe”, but lacking in terms of “efficacy”, another important measure for digital health tools. Lastly, the researchers found that GPT-4 provided different answers to the same prompt, suggesting “reliability” issues that need to be addressed. The team’s summary concludes with optimism that these limitations may be addressed with “advanced prompt engineering, harnessing methods for grounding generations on relevant literature, and fine-tuning on local data.”

Responsible AI and LLMs

Microsoft has six core principles that are the foundation for our development of AI systems: (a) fairness, (b) reliability and safety, (c) privacy and security, (d) inclusiveness, (e) transparency, and (f) accountability.[vi] We put these principles into practice through our Responsible AI Standard with the intent of addressing the challenges and limitations that are being identified when looking to apply LLMs in healthcare settings.


Principles similar to those we have listed above have been widely used by developers in bringing the existing generation of Healthcare AI to market, resulting in (according to a list provided by one regulator) more than 500 AI/ML-enabled medical devices that have received regulatory clearance and are currently in use in healthcare settings.[vii] We are optimistic that as we and others in the industry continue to advance Responsible AI practices, our collective learnings will be useful as the industry explores how LLM-based tools can be safely designed for healthcare settings.

Flavia Rovis is senior account technology strategist at Microsoft.

John Doyle is global chief technology officer for Microsoft Healthcare & Life Sciences.

Steve Mutkoski is legal and regulatory director for Microsoft’s Health and Life Sciences industry team.


[i] GPT-4 (

[ii] Introducing ChatGPT (

[iii] ChatGPT and LLMs: what’s the risk - NCSC.GOV.UK

[iv] Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine | NEJM

[v] How Well Do Large Language Models Support Clinician Information Needs? (; Manuscript (

[vi] Responsible AI principles from Microsoft

[vii] Artificial Intelligence and Machine Learning (AI/ML)-Enabled Medical Devices | FDA