ChatGPT vs rheumatologists: cross-sectional study on accuracy and patient perception of AI-generated information for psoriatic arthritis

Forte, Giulio; Mauro, Daniele; Raimondi, Maura; Pantano, Ilenia; Gandolfo, Saviana; Cauli, Alberto; Guggino, Giuliana; Lubrano, Ennio; Guiducci, Serena; Chimenti, Maria Sole; Peluso, Giusy; D'Agostino, Maria Antonietta; Ramonda, Roberta; Caso, Francesco; Costa, Luisa; Ruscitti, Piero; Maioli, Gabriella; Lopalco, Giuseppe; Tirri, Enrico; Caporali, Roberto; Ciccia, Francesco

doi:10.1016/j.ard.2025.11.012

Objectives: Patients with rheumatic diseases frequently turn to online sources for medical information. Large language models, such as ChatGPT, may offer an accessible alternative to conventional patient‑education resources; however, their reliability remains poorly explored. We conducted an exploratory, descriptive comparison to examine whether ChatGPT-4 might provide responses comparable to those of experts. Methods: Seventy-six psoriatic arthritis (PsA) patients generated 32 questions (296 selections) grouped into 6 themes. Each question was answered by ChatGPT-4 and by 12 Italian PsA specialists (each drafted 2-3 answers). Fourteen clinicians, The 14 clinicians scored the accuracy and completeness of AI and human-generated answers, rated accuracy (1-5 Likert scale) and completeness (1-3). Interrater reliability was calculated, and mixed-effects ordinal logistic models were used to compare sources. In a separate arm, 67 PsA patients reviewed 16 randomly selected answer pairs and indicated their preference. Readability was assessed. No formal sample size calculation was performed; P values were descriptive and interpreted alongside effect sizes and 95% CIs. Results: Patients most frequently sought information on prognosis/comorbidities (54/76, 71.1%), therapy strategy (48/76, 63.2%), and treatment risks (38/76, 50.0%). Accuracy appeared comparable between ChatGPT and experts, but ChatGPT scored lower in completeness. Accuracy was lower for pregnancy/fertility, with no clear relevant differences in other domains. ChatGPT answers were chosen 491/998 times (49.2%), clinician answers 343/998 times (34.4%), and no preference 164/998 times (16.4%, P < .001), with a relative preference for ChatGPT responses in prognosis and therapy. ChatGPT responses were, on average, more readable across indices. Conclusions: In this exploratory study, ChatGPT-4 appeared able to generate accurate and readable responses to PsA-related questions and was often preferred by patients.

ChatGPT vs rheumatologists: cross-sectional study on accuracy and patient perception of AI-generated information for psoriatic arthritis / Forte, G., Mauro, D., Raimondi, M., Pantano, I., Gandolfo, S., Cauli, A., Guggino, G., Lubrano, E., Guiducci, S., Chimenti, M.S., Peluso, G., D'Agostino, M.A., Ramonda, R., Caso, F., Costa, L., Ruscitti, P., Maioli, G., Lopalco, G., Tirri, E., Caporali, R., et al.. - In: ANNALS OF THE RHEUMATIC DISEASES. - ISSN 0003-4967. - ELETTRONICO. - (2025), pp. 0-0. [10.1016/j.ard.2025.11.012]