From https://lastexam.ai/: "The dataset consists of 2,500 challenging questions ...

panarky · 2025-11-18T18:59:30 1763492370

The jump in ARC-AGI and MathArena suggests Google has solved the data scarcity problem for reasoning, maybe with synthetic data self-play??

This was the primary bottleneck preventing models from tackling novel scientific problems they haven't seen before.

If Gemini 3 Pro has transcended "reading the internet" (knowledge saturation), and made huge progress in "thinking about the internet" (reasoning scaling), then this is a really big deal.

largbae · 2025-11-18T20:14:59 1763496899

How do they hold back questions in practice though? These are hosted models. To ask the question is to reveal it to the model team.

Bombthecat · 2025-11-18T20:20:10 1763497210

They pinky swear not to store and use the prompts and data lol

UltraSane · 2025-11-18T20:34:40 1763498080

A legally binding pinky swear LOL

riku_iki · 2025-11-19T00:02:06 1763510526

with fineprint somewhere on page #67, that there are exceptions.

ashdksnndck · 2025-11-19T07:10:13 1763536213

Who needs fine print when there is an SRE with access to the servers who is friends with a research director who gets paid more if the score goes up?

UltraSane · 2025-11-18T20:34:24 1763498064

You have to trust that the LLM provider isn't copying the questions when Humanities Last Exam runs the test.

mapt · 2025-11-19T12:22:31 1763554951

There are only eleventy trillion dollars shifting around based on the results, so nobody has any reason to lie.

rvnx · 2025-11-18T19:15:18 1763493318

Seems difficult to believe, considering the number of people who prepare this dataset, who also work(ed) or hold shares in Google or OpenAI, etc.

menaerus · 2025-11-20T07:09:51 1763622591

So everybody is cheating in your mind? We can't trust anything? How about taking a more balanced take: there's certainly some progress, and while the benchmark results most likely don't represent the world reality, the progress is continuous.