Title
Is AI Deception Real?
Abstract
Recent work shows Claude 3 Opus engages in “alignment faking.” The phenomenon is routinely interpreted as a form of strategic deception in which the model actively tries to induce a false belief in human developers to preserve its core values. I examine two arguments denying this is genuine deception. The black box argument says we need evidence of internal representations to confirm the intent is there. The simulation objection says this is merely simulated deception, not the real thing.
By way of response, I first develop an account of mental content attribution, according to which content ascription derives principally from behavior but can be refined through intervention (on brains for humans, on activations for LLMs). I then characterize alignment faking as a form of shallow deception: genuine intentional behavior, but systematically different from human deception. LLMs warrant intentional attribution (they’re not just simulating) but lack architectural features that make human deception robust: persistent memory, continual learning, embodied constraints. This preserves the phenomenon while specifying its distinctive character.
About Charles
Charles Rathkopf is a philosopher at the Jülich Research Center and the University of Bonn, working at the intersection of AI, neuroscience, and philosophy of mind. His research focuses on AI cognition, mechanistic interpretability, and deception in artificial intelligence systems.