Siri failed super-easy Super Bowl test, getting 38 out of 58 wrong


Apple commentator John Gruber yesterday described Siri’s current performance as “an unfunny joke,” giving its inability to correctly name the winner of Super Bowl 13 an example, noting that this is a basic query that any US chatbot ought to be able to answer.

It turns out that wasn’t an entirely random example: it was prompted by his friend Paul Kafasis, who decided to test Siri on Super Bowl 1 to 60 inclusive – and the results were not good …

Kafasis shared the results in a blog post.

So, how did Siri do? With the absolute most charitable interpretation, Siri correctly provided the winner of just 20 of the 58 Super Bowls that have been played. That’s an absolutely abysmal 34% completion percentage. If Siri were a quarterback, it would be drummed out of the NFL.

Siri did once manage to get four years in a row correct (Super Bowls IX through XII), but only if we give it credit for providing the right answer for the wrong reason. More realistically, it thrice correctly answered three in a row (Super Bowls V through VII, XXXV through XXVII, and LVII through LIX). At its worst, it got an amazing 15 in a row wrong (Super Bowls XVII through XXXII).

Siri’s a big Eagles fan, it seems.

Most amusingly, it credited the Philadelphia Eagles with an astonishing 33 Super Bowl wins they haven’t earned, to go with the one 1 they have.

The “right answer for the wrong reason” part refers to Siri being asked to name the winner of Super Bowl X. For unknown reasons, Siri decided to respond with a lengthy reply about Super Bowl IX, and coincidentally the winner was the same both times.

Sometimes Siri went completely off-piste and completely ignored the question, quoting unrelated Wikipedia entries.

“Who won Super Bowl 23?”
Bill Belichick owns the record for the most Super Bowl wins (eight) and appearances (twelve: nine times as head coach, once as assistant head coach, and twice as defensive coordinator) by an individual.

But maybe the Roman numerals cause confusion, and other AI systems struggle just as much? Gruber decided to carry out a few spot checks.

I haven’t run a comprehensive test from Super Bowls 1 through 60 because I’m lazy, but a spot-check of a few random numbers in that range indicates that every other ask-a-question-get-an-answer agent I personally use gets them all correct.

I tried ChatGPT, Kagi, DuckDuckGo, and Google. Those four all even fare well on the arguably trick questions regarding the winners of Super Bowls 59 and 60, which haven’t yet been played. E.g., asked the winner of Super Bowl 59, Kagi’s “Quick Answer” starts: “Super Bowl 59 is scheduled to take place on February 9, 2025. As of now, the game has not yet occurred, so there is no winner to report.”

Super Bowl winners aren’t some obscure topic, like, say, asking “Who won the 2004 North Dakota high school boys’ state basketball championship?” — a question I just completely pulled out of my ass, but which, amazingly, Kagi answered correctly for Class A, and ChatGPT answered correctly for both Class A and Class B, and provided a link to this video of the Class A championship game on YouTube.

That’s amazing! I picked an obscure state (no offense to Dakotans, North or South), a year pretty far in the past, and the high school sport that I personally played best and care most about. And both Kagi and ChatGPT got it right. (I’d give Kagi an A, and ChatGPT an A+ for naming the champions of both classes, and extra credit atop the A+ for the YouTube links.)

Gruber notes that the old Siri – on macOS 15.1.1 – actually does better. Sure, it seems less capable, as it gave its classic “Here’s what I found on the web” response, but at least that gives links to the correct answer. New Siri doesn’t.

New Siri — powered by Apple Intelligence™ with ChatGPT integration enabled — gets the answer completely but plausibly wrong, which is the worst way to get it wrong. It’s also inconsistently wrong — I tried the same question four times, and got a different answer, all of them wrong, each time. It’s a complete failure.

Photo by Caleb Woods on Unsplash

FTC: We use income earning auto affiliate links. More.



Source link

Previous articleJSAUX RGB Transparent backplate and heatsink for ROG Ally review