A new study by the European Broadcasting Union (EBU) and the BBC found that leading AI models, including ChatGPT, Google’s Gemini, Microsoft’s Copilot, and Perplexity, gave faulty responses to nearly half of factual news questions.
Researchers tested more than 2,700 answers in 18 countries and 14 languages. The results were sobering: 45 percent of all answers had at least one “significant issue.”
The biggest problem? Bad sourcing, found in 31 percent of responses. That included everything from incorrect citations to made-up references and unverifiable claims. Another 20 percent of answers were simply inaccurate, while 14 percent lacked necessary context.
Among the models tested, Gemini performed the worst, with 76 percent of its responses showing major sourcing flaws.
Researchers documented glaring factual errors across all platforms. One system, Perplexity, incorrectly claimed that surrogacy is illegal in the Czech Republic. Another, ChatGPT, listed Pope Francis as the sitting pontiff months after his death.
Tech companies did not immediately respond to requests for comment.
In the study’s foreword, Jean Philip De Tender, the EBU’s deputy director-general, and Pete Archer, the BBC’s head of AI, said tech firms had failed to treat accuracy as a core issue:
“They have not prioritised this issue and must do so now. They also need to be transparent by regularly publishing their results by language and market.”
Experts warn that the stakes go far beyond embarrassing errors.
The latest news in your social feeds
Subscribe to our social media platforms to stay tuned