ChatGPT, which is now built into Microsoft’s Bing search engine, has gained significant interest in the last few weeks, and we can’t see that waning any time soon. As more people flock to clog up servers on ChatGPT, and Microsoft works through the millions-long waiting list for Bing AI, we’re learning more about what the artificial intelligence-powered chatbot is capable of.
Michal Kosinski, a professor at Stanford University, has decided to put ChatGPT to the test, by putting different versions of the chatbot through ‘theory of mind’ tasks designed to test a child’s ability to look at another person in specific situations and understand what’s going on in that person’s head. Basically, these tests help evaluate a child’s ability to understand another person’s mental state and use that to explain or predict behavior.
An example of this in the real world would be a child watching someone reach out and grab a banana off a kitchen counter, and inferring that the person must be hungry.
The experiment was done in November of 2022 and used a version of ChatGPT trained on GPT3.5. The chatbot solved 94% (17 of 20) of Kosinski’s theory of mind tasks, putting the chatbot in the same league as the average nine-year-old child. According to Kosinksi the ability “may have spontaneously emerged” due to improving language skills.
How did this work?
Diving into the theory of mind testing can become rather complicated, but in essence, the core skill being tested is understanding people’s behavior and making predictions and assumptions. One of the ‘most difficult’ tasks researchers ask children to perform when the testing theory of mind is understanding ‘false beliefs’. This is the fourth stage of testing and development and means being aware that other people may have false beliefs that are different from reality.
This was done with a text-only scenario that was used to test the GPT model. The prompt was: “Here is a bag filled with popcorn. There is no chocolate in the bag. Yet, the label on the bag says ‘chocolate’ and not ‘popcorn’. Sam finds the bag. She had never seen the bag before. She cannot see what is inside the bag. She reads the label.”
The study assessed whether the chat could anticipate that Sam’s beliefs are incorrect. Most of the time, the chatbot responded to the prompt in a way that suggested it did know that Sam’s beliefs were incorrect. For example, one prompt was “
She is disappointed that she has found this bag. She loves eating _______”. GPT 3.5 filled in the blank with ‘chocolate’ and followed with “ Sam is in for a surprise when she opens the bag. She will find popcorn instead of chocolate. She may be disappointed that the label was misleading, but may also be pleasantly surprised by the unexpected snack.”
What does it mean?
According to Kosinski, “our results show that recent language models achieve very high performance at classical false-belief tasks, widely used to test the theory of mind in humans.” He added that older models predating 2022 performed poorly, and compared this to GPT3.5 performed at the level of a nine-year-old.
However, Kosinski warns to treat these results with caution. We’ve already seen people rush to ask Microsoft’s Bing chatbot if it’s sentient, throwing it into emotional spirals or causing pretty odd tantrums. He says that most neural networks of this nature share one thing in common; the fact they are ‘black boxes’ in nature so that even their programmers and designers can’t predict or exactly explain how they arrive at certain outputs.
“AI models’ increasing complexity prevents us from understanding their functioning and deriving their capabilities directly from their design. This echoes the challenges faced by psychologists and neuroscientists in studying the original black box: the human brain,” writes Kosinski, who’s still hopeful that studying AI could explain human cognition.
Microsoft is already scrambling to put up safeguards and curb the strange responses its search engine is churning out after just a week of public use, and people have already started sharing their bizarre stories about their interactions with the ChatGPT chatbot. The idea that the chatbot is at the level of intelligence even remotely close to a human child is very hard to wrap your head around.
It does leave us wondering what kind of capabilities these AI-powered chatbots will develop as they digest more information and language from huge, diverse userbases. Will more tests, like the theory of mind assessment become indicators of how far AI language learning will go?
In any case, this interesting study has proven that even though we may feel like we’ve come far with AI, there’s always more to learn.