Anthropic's mechanistic interpretation group disagrees with you - they see simil...

wizzwizz4 · 2026-03-07T23:15:30 1772925330

Which paper? I've read all the titles and looked at a few from the past year, but it's not obvious which you're referring to.

I did also, accidentally, find some "I tried the obvious thing and the results challenge the paper's narrative" criticism of one of Anthropic's recent papers: https://www.greaterwrong.com/posts/kfgmHvxcTbav9gnxe/introsp.... So that's significantly reduced my overall trust in this research team's interpretation of their own results – specifically, their assertions of the form "there must exist". (Several people in the comments there claim to have designed their own experiments that replicate Anthropic's claims, but none of the ones I've looked at actually do: they have even more obvious flaws, like arXiv:2602.11358 being indistinguishable from "the prompt says to tell a first-person story about an AI system gaining sentience after being given a special prompt, and homonyms are represented differently within a model".)

vessenes · 2026-03-08T01:48:30 1772934510

I asked Gemini for a literature search and it came back with this:

References Chen, R., Arditi, A., Sleight, H., Evans, O., & Lindsey, J. (2025). Persona Vectors: Monitoring and Controlling Character Traits in Language Models. arXiv. https://doi.org/10.48550/arxiv.2507.21509 Cited by: 97

Greenblatt, R., Denison, C., Wright, B., Roger, F., MacDiarmid, M., Marks, S., Treutlein, J., Belonax, T., Chen, J., Duvenaud, D., Khan, A., Michael, J., Mindermann, S., Perez, E., Petrini, L., Uesato, J., Kaplan, J., Shlegeris, B., Bowman, S. R., & Hubinger, E. (2024). Alignment faking in large language models. arXiv. https://doi.org/10.48550/arxiv.2412.14093 Cited by: 237

Templeton, A., Conerly, T., Marcus, J., Lindsay, J., Bricken, T., Chen, B., ... & Henighan, T. (2024). Mapping the Mind of a Large Language Model. Anthropic Transformer Circuits Thread. https://transformer-circuits.pub/2024/scaling-monosemanticit...

Gemini thinks it’s the mapping the mind paper but I thought it was more recent than that - I think mapping the mind was the original activation circuits paper and then it was a follow on paper with a toss off comment that I noted. I didn’t keep track of it though!