From https://lastexam.ai/: "The dataset consists of 2,500 challenging questions across over a hundred subjects. We publicly release these questions, while maintaining a private test set of held out questions to assess model overfitting." [emphasis mine]
While the private questions don't seem to be included in the performance results, HLE will presumably flag any LLM that appears to have gamed its scores based on the differential performance on the private questions. Since they haven't yet, I think the scores are relatively trustworthy.
The jump in ARC-AGI and MathArena suggests Google has solved the data scarcity problem for reasoning, maybe with synthetic data self-play??
This was the primary bottleneck preventing models from tackling novel scientific problems they haven't seen before.
If Gemini 3 Pro has transcended "reading the internet" (knowledge saturation), and made huge progress in "thinking about the internet" (reasoning scaling), then this is a really big deal.
So everybody is cheating in your mind? We can't trust anything? How about taking a more balanced take: there's certainly some progress, and while the benchmark results most likely don't represent the world reality, the progress is continuous.
While the private questions don't seem to be included in the performance results, HLE will presumably flag any LLM that appears to have gamed its scores based on the differential performance on the private questions. Since they haven't yet, I think the scores are relatively trustworthy.