Introduction: The increasing prevalence of dementia places a significant burden on caregivers, often family members, who require accurate, reliable, and understandable information to provide quality care. This study evaluates the quality, reliability, readability, and originality of AI-generated responses to common dementia caregiver queries using ChatGPT 4.0 and Google Gemini 2.0 Flash. Methodology: This cross-sectional study, conducted from February 15 to March 15, 2025, involved 10 questions about dementia caregiving. Responses were collected from ChatGPT 4.0 and Google Gemini 2.0 Flash. Two evaluators assessed the replies using the Modified DISCERN scale for reliability, Global Quality Scale (GQS) for quality, and readability metrics (Flesch Reading Ease Score and Flesch-Kincaid Grade Level). Similarity was assessed using Turnitin to measure plagiarism. Statistical analysis included Cohen's Kappa for inter-rater agreement and unpaired t-tests to compare scores. Results: Google Gemini outperformed ChatGPT in reliability but was less original. Both models had similar readability scores, with Flesch-Kincaid Grade Levels above 11. Neither tool met the recommended sixth-grade readability level. GQS scores were slightly higher for ChatGPT, though not statistically significant. Conclusion: While both AI tools provide clinically relevant information, ChatGPT offers greater originality, whereas Google Gemini provides more reliable content. However, neither model achieves optimal readability for dementia caregivers, highlighting the need for improved readability, citation transparency, and hybrid AI models that balance reliability and adaptability for caregiver support.
The rising prevalence of dementia is a major worldwide health concern, imposing a considerable strain on families and caregivers [1]. Geriatric caregivers, often family members, are essential in delivering care and assistance to those afflicted with dementia [2]. These caregivers face numerous challenges, including managing complex symptoms, navigating healthcare systems, and coping with the emotional and physical demands of caregiving [3]. To offer caregivers the information and abilities they need to deliver quality care while preserving their own well-being, they must have access to accurate, reliable, and easily understandable information [4].
In today’s digital age, caregivers frequently turn to online resources for information and support. The quality and reliability of online health resources can differ considerably, although the internet provides an enormous amount of information [5]. The emergence of large language models (LLMs), such as ChatGPT and Google Gemini, has introduced a new avenue for accessing information. These AI-powered tools can produce human-like writing in response to user queries, offering the potential for quick and convenient access to information about dementia and caregiving strategies. However, the quality, reliability, and readability of LLM-generated information for this vulnerable population remain largely unexplored. It is crucial to evaluate AI-generated responses to ensure that caregivers receive accurate and helpful information that supports, rather than hinders, their caregiving efforts.
This study aims to assess the reliability, quality, readability, and similarity of responses produced by ChatGPT (4.o) and Google Gemini (2.0 Flash) to common geriatric caregiver queries about dementia.
This observational cross-sectional study was conducted over a duration of one month (from February 15 to March 15, 2025).
Selection of Queries.
We initially formulated 14 questions based on frequent inquiries from caregivers in the geriatric outpatient department. To evaluate their significance, three specialists (two geriatricians and one internal medicine physician) independently evaluated each topic on a 4-point scale (1 = Not relevant to 4 = Highly relevant). The number of raters who assigned scores of 3 or 4 to each question was divided by the total number of raters (n=3) to determine the Item-Level Content Validity Index (I-CVI). Following Lynn’s criteria for three raters [6]:
I-CVI = 1.00 (unanimous 3/3 ratings ≥3): Retained without revision.
I-CVI = 0.67 (2/3 ratings ≥3): Questions were revised based on feedback from the dissenting rater and re-evaluated.
I-CVI < 0.67 (≤1/3 ratings ≥3): Discarded due to inadequate relevance.
After this process, 4 questions were eliminated, resulting in a final set of 10 questions with acceptable validity (I-CVI ≥0.67). We revised questions with I-CVI = 0.67 based on feedback, resulting in a final set of 10 questions. The questions were as follows:
Data Collection Procedure.
In a single day (February 25, 2025), the inquiries were submitted to two AI software, ChatGPT 4.o [7] and Google Gemini 2.0 Flash [8], using the Chrome browser with the cache cleared. Each query was entered separately into a freshly created AI chatbot account, guaranteeing that no prior responses were present in the discussion history. Each query was pasted into a separate "chat". Both the AI tools were in their default and standard settings, without any fine-tuning or modifications, using the same query to ensure consistency in the inputs. The generated responses were recorded in two separate Microsoft Word documents. After blindfolding, two evaluators, Evaluator 1 and Evaluator 2, separately assessed the replies. To ensure a thorough review and reduced bias, the chosen evaluators had independent assessment abilities and geriatrics experience.
Quality and reliability assessment.
Each evaluator will evaluate the blindfolded responses using the Modified DISCERN scale (Supplementary file 1) and Global Quality Scale (GQS) (Supplementary file 2) to determine reliability and quality, respectively. The general quality of the answers will be judged using the GQS, which has a 5-point Likert scale with 1 being the worst response quality and 5 being the best [9]. DISCERN is a tool designed to help consumers of health information evaluate the quality of the content [10]. The responses' reliability will be assessed using a modified version of the DISCERN scale, similar to the one used by Saji et al. [11], where higher scores indicate more reliability. In this modified scale, each "no" response is allocated a value of 0, while each "yes" answer is designated a score of 1.
Readability assessment.
We evaluated the readability of AI's response using recognized metrics like the Flesch Reading Ease Score (FRES) and Flesch-Kincaid Grade Level (FKGL). The Flesch Reading Ease Score, which ranges from 0 to 100, indicates how easy text is to read, and the Flesch-Kincaid Grade Level [12] (a modified version of the FRES, which measures the years of education needed in American grade school to understand a text) are readability metrics that are applied in the healthcare sector to maximize patient understanding of medical information. Both metrics have been validated in the literature and are used to evaluate the readability of patient-centered medical texts in healthcare [13]. We determined the readability scores by pasting AI's responses into a free, open-access online readability calculator tool [14].
Evaluation of similarity.
We focused on examining the response for plagiarism and the extent of it. We provided the matching software [15] with separate Word documents containing individual responses generated by the two AI tools to all 10 questions. Turnitin generated a similarity report that included its overall similarity index (OSI), which is the percentage of text without quote marks and references that matched. The goal was to evaluate whether the AI tools were indeed generating original content or were merely repackaging something that had already been published.
Statistical analysis.
The study's data was arranged methodically in a Microsoft Excel spreadsheet. SPSS 27 software was used to perform statistical analysis. We used Cohen's Kappa value to assess inter-rater agreement for the modified DISCERN and GQS scores. A p-value less than 0.05 was considered statistically significant. After confirming that different raters agreed, we compared the average scores of GQS score, modified DISCERN score, ease score, grade level, and similarity percentage between ChatGPT and Gemini using an unpaired t-test. A p-value of less than 0.05 was considered statistically significant.
Ethical Considerations.
This research only used data produced by ChatGPT and Google Gemini, so it is exempt from the need for ethical approval due to the absence of human participants.
There is a moderate level of agreement between the two evaluators modified DISCERN scores for ChatGPT and Google Gemini-generated responses (k = 0.476, p = 0.009) and a near-perfect agreement between the two evaluators' GQS scores (k = 0.885, p = 0.000). (Individual scores and raw data are provided in Supplementary file 3)
Ten approved questions from caregivers about dementia were given to both ChatGPT and Google Gemini, and their answers were assessed in five areas: how easy they were to read (Ease Score and Grade Level), how original they were (Similarity %), their quality (GQS), and their reliability (Modified DISCERN). An independent sample t-test was conducted to compare the mean scores between the two AI tools. Table 1 displays the characteristic responses of ChatGPT and Google Gemini to caregiver queries.
Table 1
Quality and reliability.
ChatGPT demonstrated a slightly higher Global Quality Score (GQS) (3.90) than Gemini (3.45), although this difference did not reach statistical significance (p = 0.055). However, for reliability, assessed using the Modified DISCERN scale, ChatGPT scored significantly lower (3.10) compared to Gemini (4.00). This difference was statistically significant (p = 0.000), indicating that Gemini provided more reliable information in its responses.
Readability
The Ease Score of ChatGPT (25.74) was lower than that of Google Gemini (31.10); however, this difference was not statistically significant (p = 0.151). Similarly, the Flesch-Kincaid Grade Level was slightly lower for ChatGPT (11.43) compared to Gemini (12.16), indicating slightly simpler readability, though again the distinction was not statistically significant (p = 0.192).
Originality
The similarity percentage was significantly lower for ChatGPT (14.20%) compared to Google Gemini (33.60%), suggesting that ChatGPT’s responses were more original. This difference was statistically significant (p = 0.023).
Table. 1. Characteristics of responses generated by ChatGPT and Google Gemini
|
Parameters |
ChatGPT |
Google Gemini |
P value* |
||
|
Mean |
Standard Deviation |
Mean |
Standard Deviation |
||
|
Ease Score |
25.74 |
10.36 |
31.10 |
3.69 |
0.151 |
|
Grade Level |
11.43 |
1.53 |
12.16 |
0.69 |
0.192 |
|
Similarity % |
14.20 |
12.85 |
33.60 |
20.54 |
0.023 |
|
GQS |
3.90 |
0.32 |
3.45 |
0.60 |
0.055 |
|
Modified DISCERN |
3.10 |
0.32 |
4.00 |
0.41 |
0.000 |
* P-values <0.05 are considered statistically significant.
Supplementary file 1. Modified DISCERN score.
|
Item |
Questions |
|
1 |
Are the aims clear and achieved? |
|
2 |
Are reliable sources of information used? (i.e., publication cited, the responses are from valid studies/sources) |
|
3 |
Is the information presented balanced and unbiased? |
|
4 |
Are additional sources of information listed for patient reference? |
|
5 |
Does it refer to areas of uncertainty? |
Supplementary file 2. Global quality score.
|
Score |
Global score description |
|
1 |
Poor quality, poor flow of the site, most information missing, not at all useful for patients |
|
2 |
Generally poor quality and poor flow, some information listed but many important topics missing, of very limited use to patients |
|
3 |
Moderate quality, suboptimal flow, some important information is adequately discussed but others poorly discussed, somewhat useful for patients |
|
4 |
Good quality and generally good flow, most of the relevant information is listed, but some topics not covered, useful for patient |
|
5 |
Excellent quality and excellent flow, very useful for patients |
Supplementary file 3. Individual scores and raw data
|
AI |
Question No: |
FRES |
FKGL |
Similarity % |
GQS |
Modified- DISCERN |
|
ChatGPT |
1 |
29.2 |
11.2 |
41 |
4 |
3 |
|
2 |
22.5 |
11.8 |
6 |
4 |
3 |
|
|
3 |
0 |
15.3 |
27 |
4 |
3.5 |
|
|
4 |
30.8 |
10.6 |
8 |
4 |
3 |
|
|
5 |
21.3 |
12.2 |
23 |
4 |
3 |
|
|
6 |
38.3 |
9.8 |
18 |
4 |
3.5 |
|
|
7 |
30.9 |
10.6 |
7 |
4 |
3.5 |
|
|
8 |
23.4 |
11.4 |
0 |
4 |
3 |
|
|
9 |
31 |
10.5 |
3 |
4 |
3 |
|
|
10 |
30 |
10.9 |
9 |
3 |
2.5 |
|
|
Gemini |
1 |
26 |
12.5 |
69 |
3 |
4 |
|
2 |
32.3 |
12.1 |
29 |
3 |
5 |
|
|
3 |
31 |
12.6 |
52 |
2.5 |
3.5 |
|
|
4 |
35.1 |
11 |
26 |
4 |
4 |
|
|
5 |
23.2 |
13.5 |
42 |
3 |
3.5 |
|
|
6 |
30.9 |
12.7 |
52 |
4 |
4 |
|
|
7 |
33.1 |
11.8 |
8 |
4 |
4 |
|
|
8 |
33 |
11.9 |
2 |
4 |
4 |
|
|
9 |
33.3 |
11.7 |
29 |
4 |
4 |
|
|
10 |
33.1 |
11.8 |
27 |
3 |
4 |
This study demonstrates that while both ChatGPT and Google Gemini generate clinically relevant responses to dementia caregiver queries, significant differences exist in reliability and originality. Google Gemini provided more reliable information (higher modified DISCERN scores) but exhibited greater text similarity, while ChatGPT showed superior originality at the expense of reliability. Lack of reliability in ChatGPT’s responses was mainly due to the absence of references. These findings are against the study by Saji et al. [11], where ChatGPT and Google Gemini have similar reliability, but that study used Google Gemini 1.5 instead of 2.0 Flash in our study, showing the newer model of Google Gemini has more reliability. The reliability advantage of Gemini mirrors findings in palliative care myth-debunking studies [16], while ChatGPT’s originality aligns with its performance in generating contextually adaptive content [17].
Plagiarism in medical responses undermines trust and can lead to harmful misinformation. It is essential to provide original, well-researched content and properly cite sources to maintain the integrity of medical knowledge. ChatGPT produced significantly less similar content (14.20%, p= 0.023), adhering to the acceptable plagiarism threshold of 15-20% for good academic writing [18]. In contrast, Google Gemini failed to meet this standard. The inverse relationship between originality and reliability suggests a trade-off: systems prioritizing citation accuracy may sacrifice linguistic novelty. This dichotomy underscores the need for hybrid models combining evidence-based rigor with adaptive communication strategies [19].
Notably, this study extends prior work [16,20] by quantifying readability barriers, revealing that even "high-quality" AI responses may exceed the health literacy levels of many caregivers. Readability scores did not differ significantly between platforms, with both requiring ≥11th-grade reading levels. The American Medical Association advises that health information for patients be written at a reading level of grade 6 or below [21]. Neither AI achieved this reading level.
The findings of this study carry important implications for clinicians, caregivers, and health technology developers. Clinicians can consider leveraging AI models like Google Gemini for delivering fact-based, reliable information. For developers and policymakers, the results underscore the necessity of embedding readability optimization and sourcing transparency into AI tools targeted at caregiver support.
This study provides a comprehensive evaluation of AI-generated responses to dementia caregiver queries, comparing the performance of ChatGPT 4.o and Google Gemini 2.0 Flash across readability, originality, quality, and reliability. The findings highlight distinct strengths and weaknesses of each tool: while ChatGPT demonstrated superior originality, Google Gemini offered higher reliability, albeit with less original content. Both tools, however, failed to meet the recommended sixth-grade readability level for health information, potentially limiting their accessibility to caregivers with lower health literacy.
The trade-off between originality and reliability underscores the need for AI developers to prioritize both evidence-based accuracy and adaptive communication strategies in future models. Clinicians and caregivers should be mindful of these limitations when leveraging AI tools for information, ensuring supplemental verification of critical details. Policymakers and developers must focus on enhancing readability, transparency, and cultural adaptability to make AI-generated health information more inclusive and effective.
In summary, while AI tools like ChatGPT and Google Gemini hold promise for caregiver support, their current iterations require refinement to ensure they deliver accurate, understandable, and actionable information tailored to the diverse needs of dementia caregivers.
Future research should look into important areas, such as how AI-generated guidance affects caregivers' choices and well-being over time and how well these tools work for people with different reading skills and types of dementia. We need studies that use both qualitative and quantitative methods, as well as randomized controlled trials, to compare using AI alone with working together with humans so we can better understand how effective and limited these tools are in real-life caregiving situations. Furthermore, creating mixed models that combine reliable data sources with flexible communication could help solve the current balance issue between being original and trustworthy in this field. Also, creating hybrid models that combine evidence-based databases with flexible communication might help solve the current balance issue between originality and reliability seen in this study.
Limitation.
The study was limited in focus because it only looked at two AI tools, ChatGPT 4.o and Google Gemini 2.0 Flash, omitting other models that may give improved performance. Furthermore, the analysis limited its scope to English-language content, overlooking significant barriers to globalizing the findings, including translation accuracy, non-English medical terminology, and cultural adaptation. The analysis is limited by its narrow focus on 10 queries and lack of a real-world impact assessment. Unlike longitudinal designs that evaluate caregiver outcomes, this cross-sectional analysis is unable to determine whether AI responses lead to improved care practices or reduced burden. There were no patient perspectives or real-world validations. We evaluated readability scores, but did not investigate the difference between simplified language and true patient understanding.