> In this work, we curate high-quality datasets of true/false statements and use...

cmorez · on June 2, 2024

> [...] but they certainly don't prove what OP claimed.

OP's claim was not: "LLMs know whether text is true, false, reliable, or is epistemically calibrated".

But rather: "[LLMs condition] on latents *ABOUT* truth, falsity, reliability, and calibration".

> It's also very different to ask a model to evaluate the veracity of a nonsense statement, vs. avoiding the generation of a nonsense statement [...] probably could have been done with earlier generations of classifiers

Yes. OP's point was not about generation, it was about representation (specifically conditioning on the representation of the [con]text).

Your aside about classifiers is not only very apt, it is also exactly OP's point! LLMs are implicit classifiers, and the features they classify have been shown to include those that seem necessary to effectively predict text!

One of the earliest examples of this was the so-called ["Sentiment Neuron"](https://arxiv.org/abs/1704.01444), and for a more recent look into kind of features LLMs classify, see [Anthropic's experiments](https://transformer-circuits.pub/2024/scaling-monosemanticit...).

> It's obvious from direct experience that they're incapable of knowing true and false in a general sense.

Yes, otherwise they would be perfect oracles, instead they're imperfect classifiers.

Of course, you could also object that LLMs don't "really" classify anything (please don't), at which point the question becomes how effective they are when used as classifiers, which is what the cited experiments investigate.

timr · on June 7, 2024

> But rather: "[LLMs condition] on latents ABOUT truth, falsity, reliability, and calibration".

Yes, I know. And the paper didn't show that. It projected some activations into low-dimensional space, and claimed that since there was a pattern in the plots, it's a "latent".

The other experiments were similarly hand-wavy.

> Your aside about classifiers is not only very apt, it is also exactly OP's point! LLMs are implicit classifiers, and the features they classify have been shown to include those that seem necessary to effectively predict text!

That's what's called a truism: "if it classifies successfully, it must be conditioned on latents about truth".

cmorez · on June 9, 2024

> "if it classifies successfully, it must be conditioned on latents about truth"

Yes, this is a truism. Successful classification does not depend on latents being about truth.

However, successfully classifying between text intended to be read as either:

- deceptive or honest

- farcical or tautological

- sycophantic or sincere

- controversial or anodyne

does depend on latent representations being about truth (assuming no memorisation, data leakage, or spurious features)

If your position is that this is necessary but not sufficient to demonstrate such a dependence, or that reverse engineering the learned features is necessary for certainty, then I agree.

But I also think this is primarily a semantic disagreement. A representation can be "about something" without representing it in full generality.

So to be more concrete: "The representations produced by LLMs can be used to linearly classify implicit details about a text, and the LLM's representation of those implicit details condition the sampling of text from the LLM".