Pulp Fiction Volume 7 – A Scoping Review on Synthetic Datasets: Opportunities, Challenges and Future Directions for Endodontics

Journal Article

Nighat Naved, Fahad Umer   International Endodontic Journal

First published: 18 February 2026

This is one of the better papers in the current wave of AI-dental research, which is praise enough from me. It deals with a genuine problem: good AI cannot be built from thin air. It needs data, and not just more of it, but data that are accurate, representative, clinically relevant, and properly labelled. The authors are right to point out that synthetic data may help with privacy, rarity of cases, and class imbalance. That part is sensible.

The problem, of course, is that synthetic data are only ever as good as the real data on which they are based. If the original material is poor, narrow, badly labelled, or missing important clinical detail, then the synthetic version will simply reproduce those weaknesses in a more polished form. In simple terms: rubbish in, rubbish out, only now with better graphics. The review makes this point indirectly. The results are promising, but the reporting is inconsistent, validation is limited, and many of the studies are still some way from proper clinical usefulness.

The bigger worry is what happens next. If AI systems start learning from synthetic data that were themselves generated from imperfect datasets, and then later models learn from those outputs again, the problem can build on itself. Errors get repeated, weak assumptions become embedded, and poor information starts to look normal simply because it keeps reappearing. After enough rounds of this, one begins to wonder what, exactly, is still grounded in reality. That is why the provenance of the original data matters so much. If the starting point is good, the future may be reliable. If not, the whole thing slowly drifts.

There is a wider lesson here too. As more AI-generated material enters the world, genuinely original and well-documented data will become more valuable, not less. In fact, information created before AI began feeding so heavily on its own output may come to be seen as especially useful because it is less likely to be contaminated by this negative feedback loop. That may sound dramatic, but it is not far-fetched. The future of useful AI will depend less on how much data we have and more on whether we can still trust where they came from.

So yes, this is a thoughtful and worthwhile review, and better than most. But synthetic data are a tool, not a solution in themselves. They can improve AI, but only when built on solid foundations. Without that, we risk creating systems that sound ever more confident while becoming steadily less believable.