Anthropic's mechanistic interpretation group disagrees with you - they see similar activations for 'hallucinations' and 'known lies' in their analyses. The paper is pretty interesting actually.
So, you're wrong - you have a world view about the language model that's not backed up by hard analysis.
But, I wasn't trying to make some global point about AGI, I was just noting that the hallucinations produced by the model when I poked at it reminded me of model responses before the last couple of years of work trying to reduce these sorts of outputs through RL. Hence the "unapologetic" language.
Which paper? I've read all the titles and looked at a few from the past year, but it's not obvious which you're referring to.
I did also, accidentally, find some "I tried the obvious thing and the results challenge the paper's narrative" criticism of one of Anthropic's recent papers: https://www.greaterwrong.com/posts/kfgmHvxcTbav9gnxe/introsp.... So that's significantly reduced my overall trust in this research team's interpretation of their own results – specifically, their assertions of the form "there must exist". (Several people in the comments there claim to have designed their own experiments that replicate Anthropic's claims, but none of the ones I've looked at actually do: they have even more obvious flaws, like arXiv:2602.11358 being indistinguishable from "the prompt says to tell a first-person story about an AI system gaining sentience after being given a special prompt, and homonyms are represented differently within a model".)
I asked Gemini for a literature search and it came back with this:
References
Chen, R., Arditi, A., Sleight, H., Evans, O., & Lindsey, J. (2025). Persona Vectors: Monitoring and Controlling Character Traits in Language Models. arXiv. https://doi.org/10.48550/arxiv.2507.21509
Cited by: 97
Greenblatt, R., Denison, C., Wright, B., Roger, F., MacDiarmid, M., Marks, S., Treutlein, J., Belonax, T., Chen, J., Duvenaud, D., Khan, A., Michael, J., Mindermann, S., Perez, E., Petrini, L., Uesato, J., Kaplan, J., Shlegeris, B., Bowman, S. R., & Hubinger, E. (2024). Alignment faking in large language models. arXiv. https://doi.org/10.48550/arxiv.2412.14093
Cited by: 237
Gemini thinks it’s the mapping the mind paper but I thought it was more recent than that - I think mapping the mind was the original activation circuits paper and then it was a follow on paper with a toss off comment that I noted. I didn’t keep track of it though!
So, you're wrong - you have a world view about the language model that's not backed up by hard analysis.
But, I wasn't trying to make some global point about AGI, I was just noting that the hallucinations produced by the model when I poked at it reminded me of model responses before the last couple of years of work trying to reduce these sorts of outputs through RL. Hence the "unapologetic" language.