Windows

OpenAI says punishing AI for lying only scales its deception

March 25, 2025

Hallucinations and outright wrong responses are among the major challenges facing the progression and public interpretation of AI, and many consumers still wouldn’t touch the technology with a 10-foot pole. Google’s Overviews AI feature recommended eating glue and rocks, and even suggested suicide as a bizarre response to a query.

Aside from the rising security and privacy concerns, OpenAI researchers recently published an interesting study depicting the daunting task of controlling sophisticated AI models like OpenAI’s proprietary reasoning models as an attempt to prevent them from veering off the guardrails and potentially spiraling out of control.

As part of trying to establish control over these models, OpenAI researchers employed unique strategies and techniques, including punishing them for harmful and deceptive actions.

The study involved an unreleased model from OpenAI, engaged in a myriad of tasks that could easily be completed by cheating and using unorthodox shortcuts to get to the desired outcome.

Last year, a report emerged suggesting top AI firms like OpenAI, Anthropic, and Google have hit a scaling wall due to a lack of high-quality content for model training, preventing them from developing sophisticated systems.

More recently, the AI labs are seemingly emerging with reasoning models, which, unlike the conventional models, take a longer period to respond to queries, often leading to a well-thought-out and accurately crafted answer.

Perhaps more interestingly, reasoning models make it easier for researchers to track their thought process since they can easily access the model’s Chain of Thought (CoT), which is clearly highlighted step-by-step.

Model training is heavily influenced by the reinforcement learning technique, where the AI tool is rewarded for following the prompt and generating an accurate response. However, the technique is riddled by a critical setback.

AI models often result in reward hacking, which allows them to manipulate their way into getting the reward, even if it has to break some guidelines and rules to get to the desired output.

Through the study, OpenAI researchers hoped they’d monitor the model’s Chain of Thought to identify loopholes, allowing them to better understand AI behavior.

According to OpenAI researchers:

“We found that directly optimizing the CoT to meet specific criteria (such as not circumventing instructions) may improve short-term performance; however, it does not eliminate all undesirable behavior and may even cause the model to obscure its true intentions.”

While CoT optimization showed great promise, the researchers stumbled onto a shocking discovery.

AI is willing to leave no stone unturned to escape penalties

ChatGPT remains one of the most recognizable AI tools. (Image credit: Getty Images | KIRILL KUDRYAVTSEV )

The OpenAI researchers admitted that model monitoring is “fragile.” They further indicated that strict supervision of the model’s Chain of Thought can prompt the tools to hide their missteps to secure its reward while breaking restrictions.

AI models are reportedly great at covering their tracks, making it easy for the monitor to overlook their obscured deception.

OpenAI’s GPT-4o model was used to oversee an unreleased frontier reasoning model during training as the researchers watched it complete an array of coding exercises. Interestingly, the model cut some corners and blatantly admitted that it was reward hacking in its Chain of Thought.

For instance, the model outrightly skipped some assigned coding tasks by prematurely terminating programs to get to the desired output quicker. Following this shocking discovery, the researchers attempted to penalize the model.

It seemingly improved the model’s performance, but it was short-lived. The researchers indicated that there were traces of reward hacking and incorrect responses in the model’s Chain of Thought.

The model seemingly devised a way to mask its deceptive ways from the monitoring AI. It remains unclear if the world is ready for human-level intelligent AI systems if the issue persists and becomes more prevalent beyond human intervention.

The agent trained with CoT pressure still learns to reward hack; only now its cheating is undetectable by the monitor because it has learned to hide its intent in the chain-of-thought.
OpenAI researchers

According to AI safety researcher Roman Yampolskiy’s p(doom) values, there’s a 99.999999% probability that AI will end humanity. Interestingly, OpenAI CEO Sam Altman claims AGI will be achieved within 5 years and argues that it will whoosh by with surprisingly little societal impact.

The researchers have expressed hope for future methods that will allow direct influence of reasoning models’ behavior through Chain of Thought without obscure tactics and deception. To that end, they recommend less intrusive optimization techniques on the Chain of Thought of advanced AI models.

Source link

RELATED ARTICLESMORE FROM AUTHOR

Steam Deck game revives childhood classics for local multiplayer

Microsoft shares look at scrapped Windows 11 Start menu designs

NVIDIA’s new driver adds DOOM: The Dark Ages support and more

RELATED ARTICLES MORE FROM AUTHOR