> In this work, we curate high-quality datasets of true/false statements and use them to study in detail the structure of LLM representations of truth, drawing on three lines of evidence: 1. Visualizations of LLM true/false statement representations, which reveal clear linear structure. 2. Transfer experiments in which probes trained on one dataset generalize to different datasets. 3. Causal evidence obtained by surgically intervening in a LLM's forward pass, causing it to treat false statements as true and vice versa. Overall, we present evidence that language models linearly represent the truth or falsehood of factual statements.
You can debate whether the 3 experiments cited back the claim (I don't believe they do), but they certainly don't prove what OP claimed. Even if you demonstrated that an LLM has a "linear structure" when validating true/false statements, that's whole universe away from having a concept of truth that generalizes, for example, to knowing when nonsense is being generated based on conceptual models that can be evaluated to be true or false. It's also very different to ask a model to evaluate the veracity of a nonsense statement, vs. avoiding the generation of a nonsense statement. The former is easier than the latter, and probably could have been done with earlier generations of classifiers.
Colloquially, we've got LLMs telling people to put glue on pizza. It's obvious from direct experience that they're incapable of knowing true and false in a general sense.
> [...] but they certainly don't prove what OP claimed.
OP's claim was not: "LLMs know whether text is true, false, reliable, or is epistemically calibrated".
But rather: "[LLMs condition] on latents *ABOUT* truth, falsity, reliability, and calibration".
> It's also very different to ask a model to evaluate the veracity of a nonsense statement, vs. avoiding the generation of a nonsense statement [...] probably could have been done with earlier generations of classifiers
Yes. OP's point was not about generation, it was about representation (specifically conditioning on the representation of the [con]text).
Your aside about classifiers is not only very apt, it is also exactly OP's point! LLMs are implicit classifiers, and the features they classify have been shown to include those that seem necessary to effectively predict text!
> It's obvious from direct experience that they're incapable of knowing true and false in a general sense.
Yes, otherwise they would be perfect oracles, instead they're imperfect classifiers.
Of course, you could also object that LLMs don't "really" classify anything (please don't), at which point the question becomes how effective they are when used as classifiers, which is what the cited experiments investigate.
> But rather: "[LLMs condition] on latents ABOUT truth, falsity, reliability, and calibration".
Yes, I know. And the paper didn't show that. It projected some activations into low-dimensional space, and claimed that since there was a pattern in the plots, it's a "latent".
The other experiments were similarly hand-wavy.
> Your aside about classifiers is not only very apt, it is also exactly OP's point! LLMs are implicit classifiers, and the features they classify have been shown to include those that seem necessary to effectively predict text!
That's what's called a truism: "if it classifies successfully, it must be conditioned on latents about truth".
> "if it classifies successfully, it must be conditioned on latents about truth"
Yes, this is a truism. Successful classification does not depend on latents being about truth.
However, successfully classifying between text intended to be read as either:
- deceptive or honest
- farcical or tautological
- sycophantic or sincere
- controversial or anodyne
does depend on latent representations being about truth (assuming no memorisation, data leakage, or spurious features)
If your position is that this is necessary but not sufficient to demonstrate such a dependence, or that reverse engineering the learned features is necessary for certainty, then I agree.
But I also think this is primarily a semantic disagreement. A representation can be "about something" without representing it in full generality.
So to be more concrete: "The representations produced by LLMs can be used to linearly classify implicit details about a text, and the LLM's representation of those implicit details condition the sampling of text from the LLM".
You can debate whether the 3 experiments cited back the claim (I don't believe they do), but they certainly don't prove what OP claimed. Even if you demonstrated that an LLM has a "linear structure" when validating true/false statements, that's whole universe away from having a concept of truth that generalizes, for example, to knowing when nonsense is being generated based on conceptual models that can be evaluated to be true or false. It's also very different to ask a model to evaluate the veracity of a nonsense statement, vs. avoiding the generation of a nonsense statement. The former is easier than the latter, and probably could have been done with earlier generations of classifiers.
Colloquially, we've got LLMs telling people to put glue on pizza. It's obvious from direct experience that they're incapable of knowing true and false in a general sense.