ANALYSIS OF AI-GENERATED RESPONSES TO DEMENTIA CAREGIVER QUERIES

doi:10.61336/cmejgm/2026-04-25

Contents

Abstract
Keywords
Introduction
Methodology
Results
Discussion
Conclusion
References

Download XML

334 Views

20 Downloads

Share this article

Research Article | Volume 18 Issue 4 (April, 2026) | Pages 159 - 164

ANALYSIS OF AI-GENERATED RESPONSES TO DEMENTIA CAREGIVER QUERIES

Siju Jose Koonan.

Assistant Professor, Department of Geriatric Medicine Jubilee Mission Medical College & Research Institute East Fort Junction, Thrissur, Kerala, India – 680005.

Under a Creative Commons license

Open Access

DOI : 10.61336/cmejgm/2026-04-25

Received

March 2, 2026

Revised

March 27, 2026

Accepted

April 1, 2026

Published

April 21, 2026

Abstract

Introduction: The increasing prevalence of dementia places a significant burden on caregivers, often family members, who require accurate, reliable, and understandable information to provide quality care. This study evaluates the quality, reliability, readability, and originality of AI-generated responses to common dementia caregiver queries using ChatGPT 4.0 and Google Gemini 2.0 Flash. Methodology: This cross-sectional study, conducted from February 15 to March 15, 2025, involved 10 questions about dementia caregiving. Responses were collected from ChatGPT 4.0 and Google Gemini 2.0 Flash. Two evaluators assessed the replies using the Modified DISCERN scale for reliability, Global Quality Scale (GQS) for quality, and readability metrics (Flesch Reading Ease Score and Flesch-Kincaid Grade Level). Similarity was assessed using Turnitin to measure plagiarism. Statistical analysis included Cohen's Kappa for inter-rater agreement and unpaired t-tests to compare scores. Results: Google Gemini outperformed ChatGPT in reliability but was less original. Both models had similar readability scores, with Flesch-Kincaid Grade Levels above 11. Neither tool met the recommended sixth-grade readability level. GQS scores were slightly higher for ChatGPT, though not statistically significant. Conclusion: While both AI tools provide clinically relevant information, ChatGPT offers greater originality, whereas Google Gemini provides more reliable content. However, neither model achieves optimal readability for dementia caregivers, highlighting the need for improved readability, citation transparency, and hybrid AI models that balance reliability and adaptability for caregiver support.

Keywords

Artificial Intelligence

Patient education

ChatGPT

Google Gemini

Dementia.

INTRODUCTION

The rising prevalence of dementia is a major worldwide health concern, imposing a considerable strain on families and caregivers [1]. Geriatric caregivers, often family members, are essential in delivering care and assistance to those afflicted with dementia [2]. These caregivers face numerous challenges, including managing complex symptoms, navigating healthcare systems, and coping with the emotional and physical demands of caregiving [3]. To offer caregivers the information and abilities they need to deliver quality care while preserving their own well-being, they must have access to accurate, reliable, and easily understandable information [4].

In today’s digital age, caregivers frequently turn to online resources for information and support. The quality and reliability of online health resources can differ considerably, although the internet provides an enormous amount of information [5]. The emergence of large language models (LLMs), such as ChatGPT and Google Gemini, has introduced a new avenue for accessing information. These AI-powered tools can produce human-like writing in response to user queries, offering the potential for quick and convenient access to information about dementia and caregiving strategies. However, the quality, reliability, and readability of LLM-generated information for this vulnerable population remain largely unexplored. It is crucial to evaluate AI-generated responses to ensure that caregivers receive accurate and helpful information that supports, rather than hinders, their caregiving efforts.

This study aims to assess the reliability, quality, readability, and similarity of responses produced by ChatGPT (4.o) and Google Gemini (2.0 Flash) to common geriatric caregiver queries about dementia.

METHODOLOGY

This observational cross-sectional study was conducted over a duration of one month (from February 15 to March 15, 2025).

Selection of Queries.

We initially formulated 14 questions based on frequent inquiries from caregivers in the geriatric outpatient department. To evaluate their significance, three specialists (two geriatricians and one internal medicine physician) independently evaluated each topic on a 4-point scale (1 = Not relevant to 4 = Highly relevant). The number of raters who assigned scores of 3 or 4 to each question was divided by the total number of raters (n=3) to determine the Item-Level Content Validity Index (I-CVI). Following Lynn’s criteria for three raters [6]:

I-CVI = 1.00 (unanimous 3/3 ratings ≥3): Retained without revision.

I-CVI = 0.67 (2/3 ratings ≥3): Questions were revised based on feedback from the dissenting rater and re-evaluated.

I-CVI < 0.67 (≤1/3 ratings ≥3): Discarded due to inadequate relevance.

After this process, 4 questions were eliminated, resulting in a final set of 10 questions with acceptable validity (I-CVI ≥0.67). We revised questions with I-CVI = 0.67 based on feedback, resulting in a final set of 10 questions. The questions were as follows:

What are the early signs of dementia?
What are the best ways to manage dementia-related behavioral changes?
What are the available treatments or therapies for dementia?
How can I ensure the safety of a person with dementia at home?
Does dementia progress over time?
What dietary changes or supplements can help with dementia?
How do I handle legal and financial planning for a person with dementia?
What activities can help stimulate cognitive function in someone with dementia?
How can I prevent burnout as a dementia caregiver, and what support resources are available?
Is dementia hereditary?

Data Collection Procedure.

In a single day (February 25, 2025), the inquiries were submitted to two AI software, ChatGPT 4.o [7] and Google Gemini 2.0 Flash [8], using the Chrome browser with the cache cleared. Each query was entered separately into a freshly created AI chatbot account, guaranteeing that no prior responses were present in the discussion history. Each query was pasted into a separate "chat". Both the AI tools were in their default and standard settings, without any fine-tuning or modifications, using the same query to ensure consistency in the inputs. The generated responses were recorded in two separate Microsoft Word documents. After blindfolding, two evaluators, Evaluator 1 and Evaluator 2, separately assessed the replies. To ensure a thorough review and reduced bias, the chosen evaluators had independent assessment abilities and geriatrics experience.

Quality and reliability assessment.

Each evaluator will evaluate the blindfolded responses using the Modified DISCERN scale (Supplementary file 1) and Global Quality Scale (GQS) (Supplementary file 2) to determine reliability and quality, respectively. The general quality of the answers will be judged using the GQS, which has a 5-point Likert scale with 1 being the worst response quality and 5 being the best [9]. DISCERN is a tool designed to help consumers of health information evaluate the quality of the content [10]. The responses' reliability will be assessed using a modified version of the DISCERN scale, similar to the one used by Saji et al. [11], where higher scores indicate more reliability. In this modified scale, each "no" response is allocated a value of 0, while each "yes" answer is designated a score of 1.

Readability assessment.

We evaluated the readability of AI's response using recognized metrics like the Flesch Reading Ease Score (FRES) and Flesch-Kincaid Grade Level (FKGL). The Flesch Reading Ease Score, which ranges from 0 to 100, indicates how easy text is to read, and the Flesch-Kincaid Grade Level [12] (a modified version of the FRES, which measures the years of education needed in American grade school to understand a text) are readability metrics that are applied in the healthcare sector to maximize patient understanding of medical information. Both metrics have been validated in the literature and are used to evaluate the readability of patient-centered medical texts in healthcare [13]. We determined the readability scores by pasting AI's responses into a free, open-access online readability calculator tool [14].

Evaluation of similarity.

We focused on examining the response for plagiarism and the extent of it. We provided the matching software [15] with separate Word documents containing individual responses generated by the two AI tools to all 10 questions. Turnitin generated a similarity report that included its overall similarity index (OSI), which is the percentage of text without quote marks and references that matched. The goal was to evaluate whether the AI tools were indeed generating original content or were merely repackaging something that had already been published.

Statistical analysis.

The study's data was arranged methodically in a Microsoft Excel spreadsheet. SPSS 27 software was used to perform statistical analysis. We used Cohen's Kappa value to assess inter-rater agreement for the modified DISCERN and GQS scores. A p-value less than 0.05 was considered statistically significant. After confirming that different raters agreed, we compared the average scores of GQS score, modified DISCERN score, ease score, grade level, and similarity percentage between ChatGPT and Gemini using an unpaired t-test. A p-value of less than 0.05 was considered statistically significant.

Ethical Considerations.

This research only used data produced by ChatGPT and Google Gemini, so it is exempt from the need for ethical approval due to the absence of human participants.

RESULTS

There is a moderate level of agreement between the two evaluators modified DISCERN scores for ChatGPT and Google Gemini-generated responses (k = 0.476, p = 0.009) and a near-perfect agreement between the two evaluators' GQS scores (k = 0.885, p = 0.000). (Individual scores and raw data are provided in Supplementary file 3)

Ten approved questions from caregivers about dementia were given to both ChatGPT and Google Gemini, and their answers were assessed in five areas: how easy they were to read (Ease Score and Grade Level), how original they were (Similarity %), their quality (GQS), and their reliability (Modified DISCERN). An independent sample t-test was conducted to compare the mean scores between the two AI tools. Table 1 displays the characteristic responses of ChatGPT and Google Gemini to caregiver queries.

Table 1

Quality and reliability.

ChatGPT demonstrated a slightly higher Global Quality Score (GQS) (3.90) than Gemini (3.45), although this difference did not reach statistical significance (p = 0.055). However, for reliability, assessed using the Modified DISCERN scale, ChatGPT scored significantly lower (3.10) compared to Gemini (4.00). This difference was statistically significant (p = 0.000), indicating that Gemini provided more reliable information in its responses.

Readability

The Ease Score of ChatGPT (25.74) was lower than that of Google Gemini (31.10); however, this difference was not statistically significant (p = 0.151). Similarly, the Flesch-Kincaid Grade Level was slightly lower for ChatGPT (11.43) compared to Gemini (12.16), indicating slightly simpler readability, though again the distinction was not statistically significant (p = 0.192).

Originality

The similarity percentage was significantly lower for ChatGPT (14.20%) compared to Google Gemini (33.60%), suggesting that ChatGPT’s responses were more original. This difference was statistically significant (p = 0.023).

Table. 1. Characteristics of responses generated by ChatGPT and Google Gemini

Parameters	ChatGPT		Google Gemini		P value^*
Parameters	Mean	Standard Deviation	Mean	Standard Deviation	P value^*
Ease Score	25.74	10.36	31.10	3.69	0.151
Grade Level	11.43	1.53	12.16	0.69	0.192
Similarity %	14.20	12.85	33.60	20.54	0.023
GQS	3.90	0.32	3.45	0.60	0.055
Modified DISCERN	3.10	0.32	4.00	0.41	0.000

* P-values <0.05 are considered statistically significant.

Supplementary file 1. Modified DISCERN score.

Item	Questions
1	Are the aims clear and achieved?
2	Are reliable sources of information used? (i.e., publication cited, the responses are from valid studies/sources)
3	Is the information presented balanced and unbiased?
4	Are additional sources of information listed for patient reference?
5	Does it refer to areas of uncertainty?

Supplementary file 2. Global quality score.

Score	Global score description
1	Poor quality, poor flow of the site, most information missing, not at all useful for patients
2	Generally poor quality and poor flow, some information listed but many important topics missing, of very limited use to patients
3	Moderate quality, suboptimal flow, some important information is adequately discussed but others poorly discussed, somewhat useful for patients
4	Good quality and generally good flow, most of the relevant information is listed, but some topics not covered, useful for patient
5	Excellent quality and excellent flow, very useful for patients

Supplementary file 3. Individual scores and raw data

AI	Question No:	FRES	FKGL	Similarity %	GQS	Modified- DISCERN
ChatGPT	1	29.2	11.2	41	4	3
	2	22.5	11.8	6	4	3
	3	0	15.3	27	4	3.5
	4	30.8	10.6	8	4	3
	5	21.3	12.2	23	4	3
	6	38.3	9.8	18	4	3.5
	7	30.9	10.6	7	4	3.5
	8	23.4	11.4	0	4	3
	9	31	10.5	3	4	3
	10	30	10.9	9	3	2.5
Gemini	1	26	12.5	69	3	4
	2	32.3	12.1	29	3	5
	3	31	12.6	52	2.5	3.5
	4	35.1	11	26	4	4
	5	23.2	13.5	42	3	3.5
	6	30.9	12.7	52	4	4
	7	33.1	11.8	8	4	4
	8	33	11.9	2	4	4
	9	33.3	11.7	29	4	4
	10	33.1	11.8	27	3	4

DISCUSSION

This study demonstrates that while both ChatGPT and Google Gemini generate clinically relevant responses to dementia caregiver queries, significant differences exist in reliability and originality. Google Gemini provided more reliable information (higher modified DISCERN scores) but exhibited greater text similarity, while ChatGPT showed superior originality at the expense of reliability. Lack of reliability in ChatGPT’s responses was mainly due to the absence of references. These findings are against the study by Saji et al. [11], where ChatGPT and Google Gemini have similar reliability, but that study used Google Gemini 1.5 instead of 2.0 Flash in our study, showing the newer model of Google Gemini has more reliability. The reliability advantage of Gemini mirrors findings in palliative care myth-debunking studies [16], while ChatGPT’s originality aligns with its performance in generating contextually adaptive content [17].

Plagiarism in medical responses undermines trust and can lead to harmful misinformation. It is essential to provide original, well-researched content and properly cite sources to maintain the integrity of medical knowledge. ChatGPT produced significantly less similar content (14.20%, p= 0.023), adhering to the acceptable plagiarism threshold of 15-20% for good academic writing [18]. In contrast, Google Gemini failed to meet this standard. The inverse relationship between originality and reliability suggests a trade-off: systems prioritizing citation accuracy may sacrifice linguistic novelty. This dichotomy underscores the need for hybrid models combining evidence-based rigor with adaptive communication strategies [19].

Notably, this study extends prior work [16,20] by quantifying readability barriers, revealing that even "high-quality" AI responses may exceed the health literacy levels of many caregivers. Readability scores did not differ significantly between platforms, with both requiring ≥11th-grade reading levels. The American Medical Association advises that health information for patients be written at a reading level of grade 6 or below [21]. Neither AI achieved this reading level.

The findings of this study carry important implications for clinicians, caregivers, and health technology developers. Clinicians can consider leveraging AI models like Google Gemini for delivering fact-based, reliable information. For developers and policymakers, the results underscore the necessity of embedding readability optimization and sourcing transparency into AI tools targeted at caregiver support.

CONCLUSION

This study provides a comprehensive evaluation of AI-generated responses to dementia caregiver queries, comparing the performance of ChatGPT 4.o and Google Gemini 2.0 Flash across readability, originality, quality, and reliability. The findings highlight distinct strengths and weaknesses of each tool: while ChatGPT demonstrated superior originality, Google Gemini offered higher reliability, albeit with less original content. Both tools, however, failed to meet the recommended sixth-grade readability level for health information, potentially limiting their accessibility to caregivers with lower health literacy.

The trade-off between originality and reliability underscores the need for AI developers to prioritize both evidence-based accuracy and adaptive communication strategies in future models. Clinicians and caregivers should be mindful of these limitations when leveraging AI tools for information, ensuring supplemental verification of critical details. Policymakers and developers must focus on enhancing readability, transparency, and cultural adaptability to make AI-generated health information more inclusive and effective.

In summary, while AI tools like ChatGPT and Google Gemini hold promise for caregiver support, their current iterations require refinement to ensure they deliver accurate, understandable, and actionable information tailored to the diverse needs of dementia caregivers.

Future research should look into important areas, such as how AI-generated guidance affects caregivers' choices and well-being over time and how well these tools work for people with different reading skills and types of dementia. We need studies that use both qualitative and quantitative methods, as well as randomized controlled trials, to compare using AI alone with working together with humans so we can better understand how effective and limited these tools are in real-life caregiving situations. Furthermore, creating mixed models that combine reliable data sources with flexible communication could help solve the current balance issue between being original and trustworthy in this field. Also, creating hybrid models that combine evidence-based databases with flexible communication might help solve the current balance issue between originality and reliability seen in this study.

Limitation.

The study was limited in focus because it only looked at two AI tools, ChatGPT 4.o and Google Gemini 2.0 Flash, omitting other models that may give improved performance. Furthermore, the analysis limited its scope to English-language content, overlooking significant barriers to globalizing the findings, including translation accuracy, non-English medical terminology, and cultural adaptation. The analysis is limited by its narrow focus on 10 queries and lack of a real-world impact assessment. Unlike longitudinal designs that evaluate caregiver outcomes, this cross-sectional analysis is unable to determine whether AI responses lead to improved care practices or reduced burden. There were no patient perspectives or real-world validations. We evaluated readability scores, but did not investigate the difference between simplified language and true patient understanding.

REFERENCES

World Health Organization. Dementia [Internet]. Geneva: World Health Organization; 2023 [cited 2025 Jan 05]. Available from: https://www.who.int/news-room/fact-sheets/detail/dementia
Brodaty H, Donkin M. Family caregivers of people with dementia. Dialogues Clin Neurosci. 2009;11(2):217-28. doi: 10.31887/DCNS.2009.11.2/hbrodaty.
Lillekroken D, Halvorsrud L, Bjørge H, Gandhi S, Sivakumar PT, Goyal AR. Caregivers' experiences, challenges, and needs in caring for people with dementia in India: a scoping review. BMC Health Serv Res. 2024 Dec 30;24(1):1661. doi: 10.1186/s12913-024-12146-x.
National Institute on Aging. Caring for older adults: resources for caregivers [Internet]. 2021 [cited 2025 Jan 05]. Available from: https://www.nia.nih.gov/health/caregiving
Daraz L, Dogu C, Houde V, Bouseh S, Morshed KG. Assessing Credibility: Quality Criteria for Patients, Caregivers, and the Public in Online Health Information-A Qualitative Study. J Patient Exp. 2024 May 31;11:23743735241259440. doi: 10.1177/23743735241259440.
Lynn, M. (1986). Determination and quantification of content validity. Nursing Research, 35(6), 382-385.McCoach, D. B., Gable, R. K., & Madura, J. P. (2013). Instrument development in the affective domain (3^rd). Springer.
OpenAI. 2023 [cited 2025 Apr 26]. Available from: https://chat.openai.com/
Google. 2023 [cited 2025 Apr 26]. Available from: https://gemini.google.com/
Bernard A, Langille M, Hughes S, Rose C, Leddin D, Veldhuyzen van Zanten S. A systematic review of patient inflammatory bowel disease information resources on the World Wide Web. Am J Gastroenterol. 2007;102(9):2070-7.
Charnock D, Shepperd S, Needham G, Gann R. DISCERN: an instrument for judging the quality of written consumer health information on treatment choices. J Epidemiol Community Health. 1999;53(2):105-111. doi: 10.1136/jech.53.2.105.
Saji JG, Balagangatharan A, Bajaj S, Swarnkar V, Unni D, Dileep A. Analysis of Patient Education Guides Generated by ChatGPT and Gemini on Common Anti-diabetic Drugs: A Cross-Sectional Study. Cureus. 2025 Mar 25;17(3):e81156. doi: 10.7759/cureus.81156.
Kincaid JP, Fishburne RP Jr, Rogers RL, Chissom BS. Derivation of new readability formulas (Automated Readability Index, Fog Count and Flesch Reading Ease Formula) for Navy enlisted personnel. Millington (TN): Naval Technical Training Command, Research Branch Report 8-75; 1975. 48 p. Available from: https://stars.library.ucf.edu/istlibrary/56
Williamson JM, Martin AG. Analysis of patient information leaflets provided by a district general hospital by the Flesch and Flesch-Kincaid method. Int J Clin Pract. 2010 Dec;64(13):1824-31. doi: 10.1111/j.1742-1241.2010.02408.x.
Good Calculators. Flesch-Kincaid Calculator [Internet]. [cited 2025 Feb 12]. Available from: https://goodcalculators.com/flesch-kincaid-calculator/
Turnitin [Internet]. Oakland (CA): Turnitin LLC; [cited 2025 Feb 12]. Available from: https://www.turnitin.com/
Gondode PG, Mahor V, Rani D, Ramkumar R, Yadav P. Debunking Palliative Care Myths: Assessing the Performance of Artificial Intelligence Chatbots (ChatGPT vs. Google Gemini). Indian J Palliat Care 2024;30:284-7. doi: 10.25259/IJPC_44_2024
Cheng S, Tsai S, Bai Y, Ko C, Hsu C, Yang F, Tsai C, Tu Y, Yang S, Tseng P, Hsu T, Liang C, Su K. Comparisons of quality, correctness, and similarity between ChatGPT-generated and human-written abstracts for basic research: cross-sectional study. J Med Internet Res. 2023;25:e51229. doi: 10.2196/51229.
National College of Ireland Library. What is a good Similarity Report score? [Internet]. National College of Ireland Library; [cited 2025 Apr 19]. Available from: https://ncirl.libanswers.com/turnitinfaqs2/faq/191738
Parmanto B, Aryoyudanta B, Soekinto T, Setiawan I, Wang Y, Hu H, et al. A reliable and accessible caregiving language model (CaLM) to support tools for caregivers: development and evaluation study. JMIR Form Res. 2024;8:e54633. doi:10.2196/54633
Aguirre A, Hilsabeck R, Smith T, Xie B, He D, Wang Z, Zou N. Assessing the Quality of ChatGPT Responses to Dementia Caregivers' Questions: Qualitative Analysis. JMIR Aging. 2024 May 6;7:e53019. doi: 10.2196/53019.
Weiss BD. Health literacy and patient safety: help patients understand. Manual for clinicians. 2nd ed. Chicago, IL: American Medical Association Foundation; 2007.

Download PDF