*Result*: Leveraging large language models for heuristic usability assessment of medical software: Insights with the Radiation Planning Assistant.
Original Publication: Reston, VA : American College of Medical Physics, c2000-
Privitera MB, Evans M, Southee D. Human factors in the design of medical devices—Approaches to meeting international standards in the European Union and USA. Appl Ergon. 2017;59:251‐263. doi:10.1016/j.apergo.2016.08.034.
van der Peijl J, Klein J, Grass C, Freudenthal A. Design for risk control: the role of usability engineering in the management of use‐related risks. J Biomed Inform. 2012;45(4):795‐812. doi:10.1016/j.jbi.2012.03.006.
Tase A, Vadhwana B, Buckle P, Hanna GB. Usability challenges in the use of medical devices in the home environment: a systematic review of literature. Appl Ergon. 2022;103:103769. doi:10.1016/j.apergo.2022.103769.
Cardan RA, Covington EL, Popple RA. Code Wisely: risk assessment and mitigation for custom clinical software. J Appl Clin Med Phys. 2021;22(8):273‐279. doi:10.1002/acm2.13348.
Salomons GJ, Kelly D. A survey of Canadian medical physicists: software quality assurance of in‐house software. J Appl Clin Med Phys. 2015;16(1):336‐348. doi:10.1120/jacmp.v16i1.5115.
Cha E, Elguindi S, Onochie I, et al. Clinical implementation of deep learning contour autosegmentation for prostate radiotherapy. Radiother Oncol. 2021;159:1‐7. doi:10.1016/j.radonc.2021.02.040.
Zhang J, Johnson TR, Patel VL, Paige DL, Kubose T. Using usability heuristics to evaluate patient safety of medical devices. J Biomed Inform. 2003;36(1):23‐30. doi:10.1016/S1532‐0464(03)00060‐1.
Jiang M, Liu S, Gao J, Feng Q, Zhang Q. A usability study of 3 radiotherapy systems: a comparative evaluation based on expert evaluation and user experience. Med Sci Monit. 2019;25:578‐589. doi:10.12659/msm.913160.
Chan AJ, Islam MK, Rosewall T, Jaffray DA, Easty AC, Cafazzo JA. Applying usability heuristics to radiotherapy systems. Radiother Oncol. 2012;102(1):142‐147. doi:10.1016/j.radonc.2011.05.077.
Shier AP, Morita PP, Dickie C, Islam M, Burns CM, Cafazzo JA. Design and evaluation of a safety‐centered user interface for radiation therapy. Practical Radiat Oncol. 2018;8(5):e346‐e354. doi:10.1016/j.prro.2018.01.009.
Jiang M, Tu X, Xiao W, et al. Usability testing of radiotherapy systems as a medical device evaluation tool to inform hospital procurement decision‐making. Sci Prog. 2021;104(3):368504211036129. doi:10.1177/00368504211036129.
Gilmore D, Shier A. Usability engineering for a complex, medical device: a case study of an MR‐Linac. 2019.
Yang W, Some L, Bain M, Kang B. A comprehensive survey on integrating large language models with knowledge‐based methods. Knowledge‐Based Systems. 2025;318:113503. doi:10.1016/j.knosys.2025.113503.
Maity S, Saikia MJ. Large language models in healthcare and medical applications: a review. Bioengineering (Basel). 2025;12(6). doi:10.3390/bioengineering12060631.
Jang BS, Alcorn SR, McNutt TR, Ehsan U. Hype or reality: utility of large language models in radiation oncology. Int J Radiat. Oncol.*Biol.*Phys. 2024;120(2, Supplement):e629‐e630. doi:10.1016/j.ijrobp.2024.07.1386.
Zitu MM, Le TD, Duong T, et al. Large language models in cancer: potentials, risks, and safeguards. BJR Artif Intell. 2025;2(1):ubae019. doi:10.1093/bjrai/ubae019.
Court LE, Aggarwal A, Burger H, et al. Radiation planning assistant—a web‐based tool to support high‐quality radiotherapy in clinics with limited resources. J Vis Exp. 2023;200(200):e65504. doi:10.3791/65504.
Nealon KA, Balter PA, Douglas RJ, et al. Using failure mode and effects analysis to evaluate risk in the clinical adoption of automated contouring and treatment planning tools. Practical Radiat Oncol. 2022;12(4):e344‐e353. doi:10.1016/j.prro.2022.01.003.
Nealon KA, Douglas RJ, Han EY, et al. Hazard testing to reduce risk in the development of automated planning tools. J Appl Clin Med Phys. 2023;24(8):e13995. doi:10.1002/acm2.13995.
Kisling K, Johnson JL, Simonds H, et al. A risk assessment of automated treatment planning and recommendations for clinical deployment. Med Phys. 2019;46(6):2567‐2574. doi:10.1002/mp.13552.
Court L, Aggarwal A, Burger H, et al. Addressing the global expertise gap in radiation oncology: the radiation planning assistant. JCO Global Oncol. 2023(9):e2200431. doi:10.1200/go.22.00431.
Court LE. The radiation planning assistant: addressing the global gap in radiotherapy services. Lancet Oncol. 2024;25(3):277‐278. doi:10.1016/S1470‐2045(24)00084‐6.
Court LE, Aggarwal A, Jhingran A, et al. Artificial intelligence‐based radiotherapy contouring and planning to improve global access to cancer care. JCO Glob Oncol. 2024;10:e2300376. doi:10.1200/GO.23.00376.
Chan AJ, Islam MK, Rosewall T, Jaffray DA, Easty AC, Cafazzo JA. Applying usability heuristics to radiotherapy systems. Radiother Oncol. 2012;102(1):142‐147. doi:10.1016/j.radonc.2011.05.077.
*Further Information*
*Background: Usability engineering is essential for ensuring the safety and effectiveness of medical software, as design-related issues are a leading cause of use errors in clinical settings. Heuristic evaluation provides a practical approach to identifying usability problems, but its outcomes depend heavily on expert interpretation. Large Language Models (LLMs), such as ChatGPT, offer a potential means to augment heuristic evaluation by generating structured, context-aware usability feedback. This study explored the use of ChatGPT to support heuristic assessment of the Radiation Planning Assistant (RPA), a web-based radiotherapy planning tool designed to support clinical teams in low- and middle-income countries.
Methods: ChatGPT was provided with the RPA user and technical guides, training videos for each functional dashboard, and Zhang et al.'s 14 usability heuristics. The model was instructed to score each dashboard according to these heuristics, using Zhang's 0-4 severity scale, and to propose concrete interface improvements. The resulting feedback was reviewed and scored independently by the RPA developer team and by 13 users during a dedicated User Meeting. Comparative analysis was performed between ChatGPT, developer, and user ratings.
Results: ChatGPT identified 26 potential usability issues across six heuristic domains. The developer team considered nine of these actionable, though all were classified as minor (severity ≤ 2). User ratings showed wide variability, with nine suggestions achieving mean scores ≥ 1.5. Qualitative agreement between users and developers was limited, underscoring the importance of diverse perspectives in heuristic evaluation. Three suggestions-enhanced upload logs, reversible actions ("reopen request"), and stronger error prevention-were rated as potentially high priority by a minority of users. ChatGPT's ratings were consistent across dashboards.
Conclusions: While ChatGPT did not reveal any critical usability failures, its heuristic assessment proved valuable in prompting discussion, identifying minor refinements, and enriching both developer and user engagement with the RPA's interface design. This study demonstrates that LLMs can serve as an effective, low-cost complement to conventional heuristic evaluation, supporting early-stage usability review and stakeholder dialogue in the development of medical software.
(© 2026 The Author(s). Journal of Applied Clinical Medical Physics published by Wiley Periodicals LLC on behalf of American Association of Physicists in Medicine.)*