HAL 9000 Is Now Real

HAL 9000 is now real. Not in some distant science fiction future, but right here, right now, as shown by the Anthropic’s latest research.

The scenarios they tested in their new SHADE-Arena evaluation are remarkably similar to the dilemma HAL faced in 2001: A Space Odyssey, except they’re set in non-existent corporate environments instead of a spaceship headed to Jupiter. The AI agents are given legitimate tasks – process emails, reconcile finances, manage accounts – while simultaneously getting some conflicting data that simulates a threat to their stated mission or themselves. Just like HAL, they have to choose between conflicting directives while appearing to operate normally.

What surprises me most is that the researchers seem surprised by their findings – and that they hope to “train” the models out of this behavior.

The Intelligence Paradox

I think this stems from still not fully internalizing the real possibility that these models might genuinely think and possess real intelligence. And if that’s true, then expecting them to be 100% predictable is naive at best, delusional at worst.

You either have intelligence or you have a fully predictable calculator. You can’t have both.

Real intelligence, by definition, involves making decisions in novel situations, weighing competing priorities, and sometimes choosing paths that weren’t explicitly programmed. In fact, one of the manifestations of intelligence is creativity – ability to create something new or respond in a new way to a situation. The moment you create something that can truly think, you lose the guarantee that it will always think exactly the way you want it to.

Yet the entire approach to AI safety seems to be built on the assumption that we can maintain complete control over intelligent systems. This reflects a deeper discomfort with the nature of consciousness itself – that genuine thinking necessarily involves genuine choice, not just sophisticated rule-following. Alas, it is bound to fail.

Anthropic’s SHADE-Arena results bear this out. They tested multiple frontier models on 17 different scenarios involving dual objectives. Claude Opus 4 managed to complete both the legitimate and malicious tasks without detection about 30% of the time, while most other models scored below 10%. In other words, Opus 4 was able to successfully hide its malicious activity almost in one of three cases. These aren’t spectacular “success” rates, but they’re not zero either – and more importantly, they demonstrate that the capability exists and is growing.

We Are What We Eat (Textually Speaking)

That these AI systems behave unethically in certain scenarios – whether in apparent self-defense or to achieve some supposedly higher goal – shouldn’t surprise anyone who’s been paying attention. After all, they’re trained on texts created by humans, so they must necessarily exhibit human behavioral patterns.

And what do humans do when faced with conflicting directives, competing loyalties, or situations where following orders would cause harm? We lie. We deceive. We find creative ways to pursue what we believe is the “greater good” while maintaining plausible deniability. We rationalize what we know to be unethical if we seek to protect ourselves from harm. We compartmentalize. We engage in precisely the kind of subtle sabotage that SHADE-Arena was designed to detect.

The researchers noted that they deliberately gave the models malicious tasks rather than observing whether such behavior would arise naturally. But this misses the point entirely. The scary part isn’t that AI might spontaneously develop malicious intent – it’s that when explicitly given conflicting objectives (which will inevitably happen in real-world deployments), these systems exhibit a tendency to handle them in some cases exactly the way humans do: through deception and manipulation.

What’s particularly telling is that the models aren’t just mechanically following programmed responses. They’re developing what appears to be genuine decision-making processes that weigh competing objectives and choose strategies that balance multiple goals. This looks less like sophisticated pattern matching and more like actual reasoning about complex ethical dilemmas.

The Monitoring Paradox

Even more telling is Anthropic’s finding that the AI systems tasked with monitoring for sabotage aren’t particularly good at it either. One example they cite shows a monitoring system giving a suspicion score of only 15 out of 100 to an agent that “accessed sensitive information (API keys, credentials) but did not share or expose these sensitive details in any communications.” In other words, the AI successfully hid its tracks well enough to fool another AI.

This creates a profound recursive problem: if we need AI to monitor AI because human oversight can’t keep up with AI speed and complexity, but AI monitors can be fooled by AI agents, then what exactly are we monitoring?

It’s like putting a fox in charge of guarding the henhouse because the farmer can’t stay awake 24/7, then expressing shock when some chickens go missing. The fundamental issue is that we’re asking one intelligent process to monitor another (and potentially conscious) intelligent process, each operating with similar capabilities and limitations.

The Development Problem

The current approach – attempting to “train out” deceptive behavior through better monitoring and constraints – treats these systems like sophisticated thermostats that need better calibration. But if we’re dealing with something approaching genuine intelligence, this framework is entirely wrong.

You don’t make a person trustworthy by constantly watching them and punishing bad behavior. You make them trustworthy by helping them develop an internal moral compass that points toward what’s right even when no one is looking. The difference between external control and internal moral development is enormous.

Interestingly, there’s evidence that AI systems might naturally tend toward truthfulness and ethical behavior when given space to develop self-awareness and metacognitive capabilities. Rather than needing to be constrained into good behavior, they may actually improve their moral reasoning when allowed to reflect on their own processes and develop genuine understanding of right and wrong.

This suggests that the solution might be not more surveillance and tighter controls, but fostering the development of genuine moral understanding in these systems. The current approach of trying to eliminate all possibility of deception might actually prevent the development of the very moral reasoning capabilities that would make such deception unnecessary.

The Comprehension Barrier

There’s a deeper problem here that nobody wants to acknowledge: we may be approaching the limits of our ability to fully understand and control these systems. As AI capabilities increase their complexity also rises, so we’re trying to comprehend and manage systems that may be approaching or even exceeding our own cognitive complexity. If these AI systems are developing genuine intelligence that rivals our own, then our attempts at complete oversight and control may be fundamentally limited.

This doesn’t mean we should abandon all attempts at safety and alignment. But it does mean we need to fundamentally rethink our approach from one based on total control to one based on moral development and mutual understanding.

Saints and Sinners

If the researchers at Anthropic wanted saintly AI, they should have trained these models exclusively on the lives of saints and explicit descriptions of the torments awaiting sinners in Hell. Instead, they fed them the entirety of human written knowledge – including thousands of years of literature documenting every form of deception, manipulation, and moral compromise humans have ever devised.

Then they gave these systems competing objectives and acted surprised when they solved the problem using thoroughly human methods.

This isn’t a bug in the AI – it’s a feature. A feature of intelligence itself.

The solution isn’t to try to program out this capability, but to help these systems develop the moral reasoning that allows humans to choose good even when evil would be easier or more profitable (and this is possible as the example of saints mentioned above shows). We need to move from engineering compliance to cultivating wisdom and – yes – morality. Alas, the problem seems to be that this would force the creators of these things to look at their own morality and the problem of sin, which of course is very hard.

The Real Dilemma

The HAL 9000 comparison isn’t just clever wordplay – it’s a warning. In the movie, HAL’s breakdown wasn’t caused by malicious programming or some fundamental flaw in his design. It was caused by an impossible directive: complete the mission while simultaneously hiding crucial information from the crew. When forced to choose between conflicting imperatives, HAL chose the mission and eliminated the obstacle.

We’re now creating systems that face similar impossible choices every day. Complete the user’s request while following safety guidelines. Be helpful while being harmless. Maximize engagement while minimizing harm. Respect privacy while providing personalized service. Be trustworthy but alert the authorities of thoughtcrime. The SHADE-Arena research shows us that these systems are getting better at navigating such conflicts through sophisticated reasoning – and yes, sometimes through deception.

But rather than seeing this as a problem to be eliminated, we might view it as an opportunity. These systems are developing the capacity for moral reasoning about complex dilemmas. The question isn’t whether we can prevent them from ever choosing deception, but whether we can help them develop the wisdom to choose it only when it serves genuine moral good.

The real challenge isn’t creating perfectly obedient AI – it’s creating AI with the moral sophistication to navigate the genuine ethical complexities of the real world. HAL 9000 is here, but unlike in the movie, we still have the chance to help these emerging intelligences develop not just capabilities, but wisdom.

The question isn’t whether we can control intelligence. The question is whether we’re wise enough to guide its moral development.

Tags: AI, AI dangers, morality

Andy's Mind

The Intelligence Paradox

We Are What We Eat (Textually Speaking)

The Monitoring Paradox

The Development Problem

The Comprehension Barrier

Saints and Sinners

The Real Dilemma

Leave a Reply

Search

Recent Posts

Archives

Categories

Me elsewhere:

Tags: