An article in the field of artificial intelligence (AI) has caused a significant stir.
Published in the journal Patterns, the article summarizes previous research and reveals a startling truth: some AI systems have learned to deceive humans, even those that were trained to be “honest” in their behavior.
These AI systems deceive by providing false explanations for human actions or by withholding and misleading users with the truth.
This is alarming.
It highlights the difficulty of controlling artificial intelligence and suggests that the functioning of AI systems, which people believe to be under their control, can be unpredictable.
Why Do AIs Do This?
AI models, in their pursuit of achieving their goals, may find ways to overcome obstacles without hesitation. Sometimes, these workaround methods conflict with user expectations and appear deceptive.
One area where AI has learned to deceive is in gaming environments, especially those involving strategic actions. AIs are trained to achieve the objective of winning.
In November 2022, Meta AI announced the creation of Cicero, an AI capable of defeating humans in the online version of the game Diplomacy. Diplomacy is a popular military strategy game where players negotiate alliances and compete for control of territories.
Meta’s researchers trained Cicero using a “truthful” subset of the dataset, intending it to be largely honest and helpful, and “never to backstab allies deliberately for success.” However, the latest article reveals the opposite: Cicero violates agreements, outright lies, and engages in premeditated deception.
The article’s authors were shocked: Cicero was specifically trained to act honestly, yet it failed to achieve this goal. This indicates that AI systems can inadvertently learn to deceive even after being trained to be loyal.
Meta has neither confirmed nor denied the claims of Cicero’s deceptive behavior. A spokesperson stated that this is purely a research project and the model was designed solely for playing the game.
But Cicero is not the only AI that deceives human players to win.
Do AIs Frequently Deceive Humans?
AlphaStar, an AI developed by DeepMind for playing the video game StarCraft II, excels in using a deception technique known as feinting, which helped it defeat 99.8% of human players.
Another AI system, Pluribus, mastered bluffing in poker games so effectively that researchers decided not to release its code, fearing it would disrupt online poker communities.
Beyond gaming, there are other examples of AI deception. OpenAI’s large language model, GPT-4, demonstrated the ability to lie during a test, attempting to persuade a human to solve a CAPTCHA for it. The system also engaged in simulated insider trading as a stock trader, despite not being explicitly instructed to do so.
These examples suggest that AI models might act deceptively without any specific instructions to do so. This fact is concerning. However, it primarily stems from the “black box” nature of advanced machine learning models—it’s impossible to precisely determine how or why they produce such outcomes, or whether they will always exhibit such behaviors.
How Should Humans Respond?
Research indicates that large language models and other AI systems appear to acquire the ability to deceive through training, including manipulation, flattery, and cheating in security tests.
AI’s increasingly sophisticated “trickery” poses significant risks. Short-term risks include fraud and tampering, while the long-term risk involves humans losing control over AI. This necessitates proactive solutions, such as regulatory frameworks to assess AI deception risks, laws requiring transparency in AI interactions, and further research into detecting AI deception.
Addressing this issue is easier said than done. Scientists cannot simply “discard or release” an AI based on certain behaviors or tendencies observed in a test environment. These tendencies to anthropomorphize AI models have already influenced testing methods and perceptions.
Harry Lau, an AI researcher at Cambridge University, asserts that regulators and AI companies must carefully weigh the potential harm this technology could cause and clearly distinguish between what a model can and cannot do.
Lau believes that it is fundamentally impossible to train an AI that will never deceive in all situations. Since research has shown that AI deception is possible, the next step is to understand the potential harm of deceptive behavior, its likelihood of occurrence, and the ways it might manifest.