> what romanian football player won the premier league
> The only Romanian football player to have won the English Premier League (as of 2025) is Florin Andone, but wait — actually, that’s incorrect; he never won the league.
> ...
> No Romanian footballer has ever won the Premier League (as of 2025).
Yes, this is what we needed, more "conversational" ChatGPT... Let alone the fact the answer is wrong.
My worry is that they're training it on Q&A from the general public now, and that this tone, and more specifically, how obsequious it can be, is exactly what the general public want.
Most of the time, I suspect, people are using it like wikipedia, but with a shortcut to cut through to the real question they want answered; and unfortunately they don't know if it is right or wrong, they just want to be told how bright they were for asking it, and here is the answer.
OpenAI then get caught in a revenue maximising hell-hole of garbage.
LLMs only really make sense for tasks where verifying the solution (which you have to do!) is significantly easier than solving the problem: translation where you know the target and source languages, agentic coding with automated tests, some forms of drafting or copy editing, etc.
General search is not one of those! Sure, the machine can give you its sources but it won't tell you about sources it ignored. And verifying the sources requires reading them, so you don't save any time.
I agree a lot with the first part, the only time I actually feel productive with them is when I can have a short feedback cycle with 100% proof if it's correct or not, as soon as "manual human verification" is needed, things spiral out of control quickly.
> Sure, the machine can give you its sources but it won't tell you about sources it ignored.
You can prompt for that though, include something like "Include all the sources you came across, and explain why you think it was irrelevant" and unsurprisingly, it'll include those. I've also added a "verify_claim" tool which it is instructed to use for any claims before sharing a final response, checks things inside a brand new context, one call per claim. So far it works great for me with GPT-OSS-120b as a local agent, with access to search tools.
> You can prompt for that though, include something like "Include all the sources you came across, and explain why you think it was irrelevant" and unsurprisingly, it'll include those. I've also added a "verify_claim" tool which it is instructed to use for any claims before sharing a final response, checks things inside a brand new context, one call per claim. So far it works great for me with GPT-OSS-120b as a local agent, with access to search tools.
Not everyone uses LLMs the same way, which is made extra clear because of the announcement this submission is about. I don't want conversational LLMs, but seems that perspective isn't shared by absolutely everyone, and that makes sense, it's a subjective thing how you like to be talked/written to.
> Explain your setup in more detail please?
I don't know what else to tell you that I haven't said already :P Not trying to be obtuse, just don't know what sort of details you're looking for. I guess in more specific terms; I'm using llama.cpp(/llama-server) as the "runner", and then I have a Rust program that acts as the CLI for my "queries", and it makes HTTP requests to llama-server. The requests to llama-server includes "tools", where one of those is a "web_search" tool hooked up to a local YaCy instance, another is "verify_claim" which basically restarts a new separate conversation inside the same process, with access to a subset of the tools. Is that helpful at all?
"one call per claim" I wonder how long it takes for it to be common knowledge how important this is. Starting to think never. Great idea by the way, I should try this.
I've been trying to figure out ways of highlighting why it's important and how it actually works, maybe some heatmap of the attention of previous tokens, so people can see visually how messed up things become once even two concepts at the same time are mixed.
One of the dangers of automated tests is that if you use an LLM to generate tests, it can easily start testing implemented rather than desired behavior. Tell it to loop until tests pass, and it will do exactly that if unsupervised.
And you can’t even treat implementation as a black box, even using different LLMs, when all the frontier models are trained to have similar biases towards confidence and obsequiousness in making assumptions about the spec!
Verifying the solution in agentic coding is not nearly as easy as it sounds.
Not only can it easily do this, I've found that Claude models do this as a matter of course. My strategy now has been to either write the test or write the implementation and use Claude for the other one. That keeps it a lot more honest.
I've often found it helpful in search. Specifically, when the topic is well-documented, you can provide a clear description, but you're lacking the right words or terminology. Then it can help in finding the right question to ask, if not answering it. Recall when we used to laugh at people typing in literal questions into the Google search bar? Those are the exact types of queries that the LLM is equipped to answer. As for the "improvements" in GPT 5.1, seems to me like another case of pushing Clippy on people who want Anton.
https://www.latent.space/p/clippy-v-anton
That's a major use case, especially if the definition is broad enough to include take my expertise, knowledge and perhaps a written document, and transmute it to others forms--slides, illustrations, flash cards, quizzes, podcasts, scripts for an inbound call center.
But there seem to be uses where a verified solution is irrelevant. Creativity generally--an image, poem, description of an NPC in a roleplaying game, the visuals for a music video never have to be "true", just evocative. I suppose persuasive rhetoric doesn't have to be true, just plausible or engaging.
As for general search, I don't know that we can say that "classic search" can be meaningful said to tell you about the sources it ignored. I will agree that using OpenAI or Perplexity for search is kind of meh, but Google's AI Mode does a reasonable job at informing you about the links it provides, and you can easily tab over to a classic search if you want. It's almost like having a depth of expertise doing search helps in building a search product the incorporates an LLM...
But, yeah, if one is really disinterested in looking at sources, just chatting with a typical LLM seems a rather dubious way to get an accurate or reasonable comprehensive answer.
With search engine results you can easily see and judge the quality of the sources. With LLMs, even if they link to sources, you can’t be sure they are accurately representing the content. And once your own mind has been primed with the incorrect summary, it’s harder to pull reality out of the sources, even if they’re good (or even relevant — I find LLMs often pick bad/invalid sources to build the summary result).
Exactly. I've gotten much more interested by LLM now that i've accepted I can just look at the final result (code) without having to read any of the justification wall of text, which is generally convincing bullshit.
It's like working with a very cheap, extremely fast, dishonest and lazy employee. You can still get them to help you but you have to check them all the time.
The ass licking is dangerous to our already too tight information bubbles, that part is clear. But that aside, I think I prefer a conversational/buddylike interaction to an encyclopedic tone.
Intuitively I think it is easier to make the connection that this random buddy might be wrong, rather than thinking the encyclopedia is wrong. Casualness might serve to reduce the tendency to think of the output as actual truth.
Its very frustating that it can't be relied upon. I was asking gemini this morning about Uncharted 1,2 and 3 if they had a remastered version for the PS5. It said no. Then 5 minutes later I on the PSN store there were the three remastered versions for sale.
People have been using, "It's what the [insert Blazing Saddles clip here] want!" for years to describe platform changes that dumb down features and make it harder to use tools productively. As always, it's a lie; the real reason is, "The new way makes us more money," usually by way of a dark pattern.
Stop giving them the benefit of the doubt. Be overly suspicious and let them walk you back to trust (that's their job).
> My worry is that they're training it on Q&A from the general public now, and that this tone, and more specifically, how obsequious it can be, is exactly what the general public want.
That tracks; it's what's expected of human customer service, too. Call a large company for support and you'll get the same sort of tone.
I just asked chatgpt 5.1 auto (not instant) on teams account, and its first repsonse was...
I could not find a Romanian football player who has won the Premier League title.
If you like, I can check deeper records to verify whether any Romanian has been part of a title-winning squad (even if as a non-regular player) and report back.
Then I followed up with an 'ok' and it then found the right player.
Just to rule out a random error, I asked the same question two more times in separate chats to gpt 5.1 auto, below are responses...
#2:
One Romanian footballer who did not win the Premier League but played in it is Dan Petrescu.
If you meant actually won the Premier League title (as opposed to just playing), I couldn’t find a Romanian player who is a verified Premier League champion.
Would you like me to check more deeply (perhaps look at medal-winners lists) to see if there is a Romanian player who earned a title medal?
#3:
The Romanian football player who won the Premier League is Costel Pantilimon.
He was part of Manchester City when they won the Premier League in 2011-12 and again in 2013-14.
Wikipedia
+1
The Romanian football player who won the Premier League is Gheorghe Hagi. He played for Galatasaray in Turkey but had a brief spell in the Premier League with Wimbledon in the 1990s, although he didn't win the Premier League with them.
However, Marius Lăcătuș won the Premier League with Arsenal in the late 1990s, being a key member of their squad.
Same here, but with the default 5.1 auto and no extra settings. Every time someone posts one of these I just imagine they must have misunderstood the UI settings or cluttered their context somehow.
Why is this top comment.. this isn't a question you ask an LLM. But I know, that's how people are using them and is the narrative which is sold to us...
You see people (business people who are enthusiastic about tech, often), claiming that these bots are the new Google and Wikipedia, and that you’re behind the times if you do, what amounts, to looking up information yourself.
We’re preaching to the choir by being insistent here that you prompt these things to get a “vibe” about a topic rather than accurate information, but it bears repeating.
They are only the new Google when they are told to process and summarize web searches. When using trained knowledge they're about as reliable as a smart but stubborn uncle.
Pretty much only search-specific modes (perplexity, deep research toggles) do that right now...
Out of curiosity, is this a question you think Google is well-suited to answer^? How many Wikipedia pages will you need to open to determine the answer?
When folks are frustrated because they see a bizarre question that is an extreme outlier being touted as "model still can't do _" part of it is because you've set the goalposts so far beyond what traditional Google search or Wikipedia are useful for.
^ I spent about five minutes looking for the answer via Google, and the only way I got the answer was their ai summary. Thus, I would still need to confirm the fact.
Unlike the friendly bot, if I can’t find credible enough sources I’ll stay with an honest “I don’t know”, instead of praising the genius of whoever asked and then making something up.
Sure, but this is a false dichotomy. If I get an unsourced answer from ChatGPT, my response will be "eh you can't trust this, but ChatGPT thinks x"
And then you can use that to quickly look - does that player have championships mentioned on their wiki?
It's important to flag that there are some categories that are easy (facts that haven't changed for ten years on Wikipedia) for llms, but inference only llms (no tools) are extremely limited and you should always treat them as a person saying "I seem to recall x"
Is the ux/marketing deeply flawed? Yes of course, I also wish an inference-only response appropriately stated its uncertainty (like a human would - eg without googling my guess is x). But among technical folks it feels disingenuous to say "models still can't answer this obscure question" as a reason why they're stupid or useless.
It's not how I use LLMs. I have a family member who often feels the need to ask ChatGPT almost any question that comes up in a group conversation (even ones like this that could easily be searched without needing an LLM) though, and I imagine he's not the only one who does this. When you give someone a hammer, sometimes they'll try to have a conversation with it.
I'll respond to this bait in the hopes that it clicks for someone how to _not_ use an LLM..
Asking "them"... your perspective is already warped. It's not your fault, all the text we've previously ever seen is associated with a human being.
Language models are mathematical, statistical beasts. The beast generally doesn't do well with open ended questions (known as "zero-shot"). It shines when you give it something to work off of ("one-shot").
Some may complain of the preciseness of my use of zero and one shot here, but I use it merely to contrast between open ended questions versus providing some context and work to be done.
Some examples...
- summarize the following
- given this code, break down each part
- give alternatives of this code and trade-offs
- given this error, how to fix or begin troubleshooting
I mainly use them for technical things I can then verify myself.
While extremely useful, I consider them extremely dangerous. They provide a false sense of "knowing things"/"learning"/"productivity". It's too easy to begin to rely on them as a crutch.
When learning new programming languages, I go back to writing by hand and compiling in my head. I need that mechanical muscle memory, same as trying to learn calculus or physics, chemistry, etc.
> Language models are mathematical, statistical beasts. The beast generally doesn't do well with open ended questions (known as "zero-shot"). It shines when you give it something to work off of ("one-shot").
That is the usage that is advertised to the general public, so I think it's fair to critique it by way of this usage.
Yeah, the "you're using it wrong" argument falls flat on its face when the technology is presented as an all-in-one magic answer box. Why give these companies the benefit of the doubt instead of holding them accountable for what they claim this tech to be? https://www.youtube.com/watch?v=9bBfYX8X5aU
I like to ask these chatbots to generate 25 trivia questions and answers from "golden age" Simpsons. It fabricates complete BS for a noticeable number of them. If I can't rely on it for something as low-stakes as TV trivia, it seems absurd to rely on it for anything else.
Whenever I read something like this I do definitely think "you're using it wrong". This question would've certainly tripped up earlier models but new ones have absolutely no issue making this with sources for each question. Example:
(the 7 minutes thinking is because ChatGPT is unusually slow right now for any question)
These days I'd trust it to accurately give 100 questions only about Homer. LLMs really are quite a lot better than they used to be by a large margin if you use them right.
Fwiw, if you can use a thinking model, you can get them to do useful things. Find specific webpages (menus, online government forms - visa applications or addresses, etc).
The best thing about the latter is search ads have extremely unfriendly ads that might charge you 2x the actual fee, so using Google is a good way to get scammed.
If I'm walking somewhere (common in NYC) I often don't mind issuing a query (what's the salt and straw menu in location today) and then checking back in a minute. (Or.... Who is playing at x concert right now if I overhear music. It will sometimes require extra encouragement - "keep trying" to get the right one)
You either give them the option to search the web for facts or you ask them things where the utility/validity of the answer is defined by you (e.g. 'summarize the following text...') instead of the external world.
I really only use LLMs for coding and IT related questions. I've had Claude self-correct itself several times about how something might be the more idiomatic way do do something after starting to give me the answer. For example, I'll ask how to set something up in a startup script and I've had it start by giving me strict POSIX syntax then self-correct once it "realizes" that I am using zsh.
I find it amusing, but also I wonder what causes the LLM to behave this way.
Some people are guilty of writing stuff as they go along it as well. You could maybe even say they're more like "thinking out loud", forming the idea and the conclusion as they go along rather than knowing it from the beginning. Then later, when they have some realization, like "thinking out loud isn't entirely accurate, but...", they keep the entire comment as-is rather than continuously iterate on it like a diffusion model would do. So the post becomes like a chronological archive of what the author thought and/or did, rather than just the conclusion.
Because, one way or another, we will need to do that for LLMs to be useful. Whether the facts are in the training data or the context knowledge (RAG provided), is irrelevant. And besides, we are supposed to trust that these things have "world knowledge" and "emergent capabilities", precisely because their training data contain, well, facts.
For non thinking/agentic models, they must 1-shot the answer. So every token it outputs is part of the response, even if it's wrong.
This is why people are getting different results with thinking models -- it's as if you were going to be asked ANY question and need to give the correct answer all at once, full stream-of-consciousness.
Yes there are perverse incentives, but I wonder why these sorts of models are available at all tbh.
> The only Romanian football player to have won the English Premier League (as of 2025) is Florin Andone, but wait — actually, that’s incorrect; he never won the league.
> ...
> No Romanian footballer has ever won the Premier League (as of 2025).
Yes, this is what we needed, more "conversational" ChatGPT... Let alone the fact the answer is wrong.