Result: Evaluating the Efficacy of AI-Based Interactive Assessments Using Large Language Models for Depression Screening: Development and Usability Study.

Title:

Evaluating the Efficacy of AI-Based Interactive Assessments Using Large Language Models for Depression Screening: Development and Usability Study.

Authors:

Jin Z; International Joint Laboratory of Behavior and Cognitive Science, Zhengzhou Normal University, Zhengzhou, Henan, China., Hu J; School of Electrical, Computer and Telecommunications Engineering, University of Wollongong, Wollongong, Australia., Bi D; International Joint Laboratory of Behavior and Cognitive Science, Zhengzhou Normal University, Zhengzhou, Henan, China., Zhao K; International Joint Laboratory of Behavior and Cognitive Science, Zhengzhou Normal University, Zhengzhou, Henan, China.; Department of Psychology, Zhejiang Normal University, Jinhua, Zhejiang, China., Yu H; International Joint Laboratory of Behavior and Cognitive Science, Zhengzhou Normal University, Zhengzhou, Henan, China.

Source:

JMIR formative research [JMIR Form Res] 2026 Jan 13; Vol. 10, pp. e78401. Date of Electronic Publication: 2026 Jan 13.

Publication Type:

Journal Article

Language:

English

Journal Info:

Publisher: JMIR Publications Country of Publication: Canada NLM ID: 101726394 Publication Model: Electronic Cited Medium: Internet ISSN: 2561-326X (Electronic) Linking ISSN: 2561326X NLM ISO Abbreviation: JMIR Form Res Subsets: MEDLINE

Imprint Name(s):

Original Publication: Toronto, ON, Canada : JMIR Publications, [2017]-

MeSH Terms:

Depression*/diagnosis , Mass Screening*/methods , Natural Language Processing* , Psychiatric Status Rating Scales* , Artificial Intelligence*, Humans ; Female ; Male ; Adult ; Middle Aged ; Reproducibility of Results ; Psychometrics ; Feasibility Studies ; Large Language Models

References:

J Affect Disord. 2007 Jun;100(1-3):265-9. (PMID: 17156850)
Sci Rep. 2022 Mar 10;12(1):3918. (PMID: 35273198)
J Affect Disord. 2014 Mar;156:236-9. (PMID: 24480380)
NPJ Digit Med. 2025 Jun 4;8(1):332. (PMID: 40467886)
Front Neurol. 2021 Mar 08;12:640137. (PMID: 33763020)
J Affect Disord. 2011 Sep;133(1-2):179-87. (PMID: 21565408)
Psychol Med. 2023 Feb;53(3):918-926. (PMID: 34154682)
J Gen Intern Med. 2001 Sep;16(9):606-13. (PMID: 11556941)
Psychol Methods. 2019 Feb;24(1):92-115. (PMID: 29963879)
JMIR Ment Health. 2025 May 21;12:e69709. (PMID: 40397927)
J Med Internet Res. 2025 May 5;27:e69284. (PMID: 40324177)
NPJ Digit Med. 2025 Apr 30;8(1):230. (PMID: 40307331)
Psychiatry Res. 2024 Mar;333:115667. (PMID: 38290286)
Front Psychiatry. 2025 Aug 06;16:1646974. (PMID: 40842952)

Contributed Indexing:

Keywords: AAP; AI; BDI-FS; Beck Depression Inventory Fast Screen; ChatGPT; NLP; artificial intelligence; automated assessment paradigm; large language models; mental health assessment; natural language processing; psychology

Entry Date(s):

Date Created: 20260113 Date Completed: 20260113 Latest Revision: 20260131

Update Code:

20260131

PubMed Central ID:

PMC12848484

DOI:

10.2196/78401

PMID:

41529832

Database:

MEDLINE

Further Information

*Background: The evolution of language models, particularly large language models, has introduced transformative potential for psychological assessment, challenging traditional rating scale methods that have dominated clinical practice for over a century.
Objective: This study aimed to develop and validate an automated assessment paradigm that integrates natural language processing with conventional measurement tools to assess depressive symptoms, exploring its feasibility as a novel approach in psychological evaluation.
Methods: A cohort of 115 participants, including 28 (24.3%) individuals diagnosed with depression, completed the Beck Depression Inventory Fast Screen via a custom ChatGPT interface (BDI-FS-GPT) and the Chinese version of the Patient Health Questionnaire-9 (PHQ-9). Statistical analyses included the Spearman correlation (PHQ-9 vs BDI-FS-GPT scores), Cohen κ (diagnostic agreement), and area under the curve (AUC) evaluation.
Results: Spearman analysis revealed a moderate correlation between PHQ-9 and BDI-FS-GPT scores. The Cohen κ indicated moderate diagnostic agreement between the PHQ-9 and the BDI-FS-GPT (κ=0.43; 76.5% agreement), substantial agreement between the BDI-FS-GPT and the clinical diagnosis (κ=0.72; 88.7% agreement), and moderate agreement between the PHQ-9 and the clinical diagnosis (κ=0.55; 71.4% agreement). The BDI-FS-GPT demonstrated excellent diagnostic accuracy (AUC=0.953) at a cutoff of 3, detecting 89.3% of participants with depression with an 11.5% false-positive rate compared to the PHQ-9 (AUC=0.859) at a cutoff of 5 (sensitivity=71.4%; false-positive rate=13.8%). Participants also reported significantly higher satisfaction with the automated assessment compared to the traditional scale (P=.02).
Conclusions: The automated assessment paradigm framework combines the interactivity and personalization of natural language processing-powered tools with the psychometric rigor of traditional scales, suggesting a preliminary feasibility paradigm for future psychological assessment. Its ability to enhance engagement while maintaining reliability and validity provides encouraging evidence, warranting validation in larger and more diverse studies as large language model technology advances.
International Registered Report Identifier (irrid): RR2-10.1101/2024.07.19.24310543.
(©Zheng Jin, Jiaxing Hu, Dandan Bi, Kaibin Zhao, Huan Yu. Originally published in JMIR Formative Research (https://formative.jmir.org), 13.01.2026.)*

*Result*: Evaluating the Efficacy of AI-Based Interactive Assessments Using Large Language Models for Depression Screening: Development and Usability Study.

*Further Information*

*Links*

*Additional functions*

Result: Evaluating the Efficacy of AI-Based Interactive Assessments Using Large Language Models for Depression Screening: Development and Usability Study.

Further Information

Links

Additional functions