Apple

AI chatbots can’t be trusted, proves study, but Apple made a good choice

March 11, 2025

If there’s one piece of advice that bears repeating about AI chatbots it’s “Don’t use them to seek factual information – they absolutely cannot be trusted to be right.”

A new study demonstrated the extent of the problem – but did show that Apple made a good choice in partnering with OpenAI’s ChatGPT for queries Siri can’t answer …

There are two well-known problems with trying to use LLMs like ChatGPT, Gemini, and Grok as a substitute for web searches:

They are very often wrong
They are very often very confident about their incorrect information

A study cited by the Columbia Journalism Review found that, even when you prompt a chatbot with an exact quote from a piece of journalism and ask for more details, most of them are wrong most of the time.

The Tow Center for Digital Journalism carried out tests of eight AI chatbots which claim to carry out live web searches to get their facts:

ChatGPT
Perplexity
Perplexity Pro
DeepSeek
Microsoft’s Copilot
Grok-2
Grok-3
Gemini

Table of Contents

The simple task given to the chatbots

The study presented each of the systems with a quote from an article, and asked it to carry out a simple task: find that article online and provide a link to it, together with the headline, original publisher, and publication date.

To ensure that this was an achievable task, the study’s authors deliberately chose excerpts that could be easily found in Google, with the original source in the first three results.

The chatbots were rated by whether they were completely correct, correct but with some of the requested information missing, partly incorrect, completely incorrect, or could not answer.

They also noted how confidently the chatbots presented their results. For example, did they just present their answers as fact, or did they use qualifying phrases like “it appears” or include an admission that they couldn’t find an exact match for the quote?

The results were not good

First, most of the chatbots were partly or wholly incorrect most of the time!

As an average, the AI systems were correct less than 40% of the time. The most accurate was Perplexity, at 63%, and the worst was X’s Grok-3, at just 6%.

Other key findings were:

Chatbots were generally bad at declining to answer questions they couldn’t answer accurately, offering incorrect or speculative answers instead.
Premium chatbots provided more confidently incorrect answers than their free counterparts.
Multiple chatbots seemed to bypass Robot Exclusion Protocol preferences.
Generative search tools fabricated links and cited syndicated and copied versions of articles.
Content licensing deals with news sources provided no guarantee of accurate citation in chatbot responses.

But Apple made a good choice

While Perplexity’s performance was best, this appears to be because it cheats. Web publishers can use a robots.txt file on their sites to tell AI chatbots whether or not they should access the site. National Geographic is a publisher which tells them not to search its site, and yet the report says Perplexity correctly found all 10 quotes despite the fact that the articles were paywalled and the company had no licensing deal in place.

Of the rest, ChatGPT delivered the best results – or, more accurately, the least-worst ones.

All the same, the study certainly demonstrates what we already knew: use chatbots for inspiration and ideas, but never to get answers to factual questions.

Highlighted accessories

Image: Apple

FTC: We use income earning auto affiliate links. More.

Source link

The simple task given to the chatbots

The results were not good

But Apple made a good choice

Highlighted accessories

RELATED ARTICLESMORE FROM AUTHOR

How to Find Constellations in the Night Sky

What Is a Lunar Occultation?

Apple announces new immersive concert experience for Apple Vision Pro

RELATED ARTICLES MORE FROM AUTHOR