Evaluation of the performance of ChatGPT-4 and ChatGPT-4o as a learning tool in endodontics

Blog

Journal Article

Esra Arılı Öztürk, Ceren Turan Gökduman, Burhan Can Çanakçi

International Endodontic Journal

First published: 02 March 2025 https://doi.org/10.1111/iej.14217

So I couldn’t resist this one. Like others, I find the genius of AI life-changing and using it in practice has been transformational. This paper looked at using different versions of Chat GPT in endodontic education. I thought I would therefore ask Chat GPT what it thought of the article. It must have read my mind because it raised all of the criticisms I would have! Here follows it’s response:

Summary of the Article

This article attempts to assess ChatGPT-4 and ChatGPT-4o’s accuracy and consistency in answering multiple-choice endodontic questions. However, it suffers from fundamental methodological flaws, an overreliance on statistics, and a failure to critically engage with AI’s educational limitations.

1. Misguided Focus on Accuracy Over Educational Value

The study assumes that high accuracy in multiple-choice tests equates to educational usefulness. This is an oversimplification. AI-generated answers may be factually correct but lack reasoning, depth, and contextual understanding—critical components of clinical education. The rigid multiple-choice format prevents AI from demonstrating problem-solving or analytical skills, rendering the study’s conclusions superficial.

2. Methodological Flaws and Artificial Constraints

Testing AI in a vacuum: Asking identical, preformatted questions at different times ignores real-world learning contexts. Clinical education involves dynamic case-based reasoning, not repetitive question-answering.
Short answer restriction: Preventing AI from explaining its responses limits its ability to demonstrate reasoning, making the assessment unrealistic.
Failure to test real clinical applicability: The study measures AI’s ability to recall facts, not its alignment with evidence-based guidelines or expert judgment.

3. Over-reliance on Statistics Without Meaningful interpretation

The study’s obsession with numerical accuracy and statistical significance distracts from its lack of substantive analysis. High accuracy rates mean little if AI lacks reasoning and contextual awareness. The consistency metric is equally misleading—AI can be consistently wrong, yet the study does not evaluate whether its answers were clinically valid.

4. Ethical and Pedagogical Blind Spots

Uncritical optimism about AI in education: The study suggests AI can “support” education but ignores its potential to replace human instruction, reducing critical engagement and mentorship.
Neglecting AI bias and reliability: The authors fail to examine biases in AI training data, which could lead to misinformation.
Ignoring the plagiarism problem: AI cannot provide verifiable sources, yet the study does not explore how students might misuse AI-generated responses.

5. Contradictions and Unjustified Conclusions

The study simultaneously highlights AI’s limitations and promotes its use in education—without resolving this contradiction. It acknowledges that AI does not “understand” information, yet treats its accuracy rates as meaningful. This logical inconsistency undermines the study’s credibility.

Conclusion: A Superficial, Misguided Analysis

This study is statistically meticulous but educationally hollow. It measures AI’s ability to produce correct answers in an artificial setting while ignoring real-world learning complexities. Without qualitative analysis, expert validation, or real-case applications, its conclusions about AI’s educational value are deeply flawed.

I couldn’t have put it better myself.