Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The setup Turing describes isn't the "both must sum to 100%" setup you're presenting. He has two different games being played, one with two humans and one with human-vs-machine, and suggests comparing the results. E.g. if a man successfully imitates a woman only 25% of the time, then we'd ask whether the machine can pass as human equally often.

But much more importantly, as I said Turing is clearly not describing a specific experimental methodology! That's not what the paper is about, and in fact it would be somewhat absurd to run the test precisely as he describes it (since detecting a man imitating a woman is quite a different task from detecting a machine). His point is that we should approach the question of machine intelligence with actual experiments rather than asking unanswerable questions, but he only limns the general premise of what such an experiment could look like.

So I understand that you find a particular test setup better or more elegant than others, and that's fine. But you shouldn't claim that Turing's paper demands your preferred setup, or that other setups are at odds with his paper.



> The setup Turing describes isn't the "both must sum to 100%" setup you're presenting. He has two different games being played [...]

Turing introduces the game as a man imitating a woman, then modifies it to a machine imitating a human. In both of those games, the interrogator makes a binary choice between two witnesses, one real and one imitating. So P(imitator-judged-real) and P(real-judged-real) sum to 100% in both games. So in both games, a score of 50% means the imitator and the real witness are indistinguishable.

I believe that's the reason why a score of 50% is treated as significant. The "GPT-4 passes Turing test" paper uses that number as a pass threshold, and the linked site repeats it.

I'm complaining that the paper changed the game so it's no longer a binary choice, but continued to treat the 50% threshold as significant. Do you see why that's wrong? Or do you disagree?

I'm not saying that any change to Turing's formulation would be bad, just that the paper's variation is specifically bad. It would be bad in isolation too, but I believe the reason for their confusion is most understandable with reference to Turing's original formulation.

If you haven't read that paper then we're probably talking past each other. The link to that one also looks broken in the site, but it's also in the source,

https://arxiv.org/pdf/2405.08007


> I'm not saying that any change to Turing's formulation would be bad, just that the paper's variation is specifically bad.

I understand that, and I'm saying there is no "Turing's formulation". His paper argues for a certain sort of test, and the study you're talking about is the sort of test he advocated. It's not a departure or a bastardization, it's a Turing test.

As for your argument against the study, to be honest I don't see it? AFAICT the participants' goal was just to judge the humanness of a single witness, not to maximize their long-term likelihood of judging correctly over many trials, or some such thing. Even if they'd known the prior chances of speaking to an LLM, there's no obvious reason why that prior should hugely affect their conclusion after a five minute conversation - which also seems immaterial since they didn't know.

Plus, the authors give a pretty straightforward rationale for their 50% threshold, and it has no connection to Turing's 3-player setup or whether the imitator and witness are indistinguishable. If they had wanted "indistinguishable" as a threshold, then obviously their pass criteria would have been for the machine and human pass rates to be equal within an error bar, right? So it's pretty implausible to imagine a connection between that and their 50% threshold.


> AFAICT the participants' goal was just to judge the humanness of a single witness, not to maximize their long-term likelihood of judging correctly over many trials

Why do you think this matters? Even in a single trial, I would judge very differently if I knew the population to be 99% human vs. 1% human. Wouldn't you? If you were judging whether a single mushroom was poisonous or not, then would you not care whether it was found in a forest (mostly poisonous) or a supermarket (mostly not)?

The question of whether probabilities are meaningful for non-repeated events was controversial in the eighteenth century, but I thought it was pretty settled by now. Bookmakers manage to estimate a probability that a given team will win the Super Bowl, with no requirement for the same pair of teams to play multiple times.

> If they had wanted "indistinguishable" as a threshold, then obviously their pass criteria would have been for the machine and human pass rates to be equal within an error bar, right?

The title of the paper is literally "People cannot distinguish GPT-4 from a human in a Turing test". They're very clear that they think that's because 50% means indistinguishable:

> A baseline of 50% is better justified since it indicates that interrogators are not better than chance at identifying machines [French, 2000].

That statement is true for a Turing test with a binary choice, but false for theirs. I agree that "for the machine and human pass rate to be equal within an error bar" would be closer to a correct criterion, and they weren't:

> humans’ pass rate was significantly higher than GPT-4’s (z = 2.42, p = 0.017)

So do you think their paper is correctly titled?


> Why do you think this matters?

I said that in passing, maybe I should have omitted it - the main point with the priors is that the respondents didn't know them. It's normal for a study to compare N things to a control by testing N+1 similarly-sized groups, because subjects are not biased by priors they don't know about.

> So do you think their paper is correctly titled?

I didn't say anything about that, and have no strong opinion. (I'm not here to defend every aspect of the paper!)


Can you clarify what you think the paper says that's correct? If you're unwilling to provide an opinion on the paper's literal headline claim, then I don't know what we're discussing here.

> because subjects are not biased by priors they don't know about

It feels like you want the subjects to be "unbiased", "without any prior"? That concept doesn't exist, though. If no prior is supplied, then the subjects will make their best guess based on general past experience; but that's still a prior, just an idiosyncratic personal one. Very few people would put numbers to the forest vs. supermarket mushroom, but it's the same general thought process.

If that prior matches the actual distribution, then good. If the actual distribution contains more machines than expected, then the interrogator is more likely to misjudge a machine as a human. By analogy, if I think mistakenly that a mushroom came from a supermarket but it actually came from a forest, then I'm more likely to misjudge it as non-poisonous.

The binary choice version makes it obvious that the prior is 50%, and forces the interrogator to respect that. The paper's version has sent us into this epistemological tarpit, which seems strictly worse to me.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: