Objective: To assess the accuracy and comprehensiveness of health information generated by different large language models (LLMs) focusing on urinary incontinence.
Methods: Using the website www.answerthepublic.com, we retrieved the most frequently searched questions related to urinary incontinence. After applying exclusion criteria, the chosen questions, categorized into definition/diagnosis, causes, treatment, complications, and others, were input into LLMs: GPT-3.5, GPT-4, and BARD. Outputs were assessed for accuracy and comprehensiveness by two urologists using a Likert scale.
Results: Of the initial 630 questions, 38 were selected for analysis. GPT-4 demonstrated superior performance, with 73.68% of its responses achieving the maximum accuracy score, significantly outperforming GPT-3.5 (42.11%) and BARD (28.95%). In terms of comprehensiveness, GPT-4 also excelled with a score of 71.05%, whereas GPT-3.5 and BARD scored 36.84% and 28.95% respectively. For the 'causes' category, GPT-4 provided significantly more comprehensive responses.
Conclusion: While all LLMs generated relevant health information on urinary incontinence, GPT-4 showed superior accuracy and comprehensiveness. However, the potential for generating incorrect information by these models necessitates caution in their utilization.