You can ask a model to provided an analysis of its answer including a probability that it is correct as part of the prompt, helps with doublechecking a lot.
They're consistent to the model, particularly if you ask the model to rationalize its rating. You will get plenty of hallucinated answers that the model can recognize as hallucinations and give a low rating to in the same response.
Models can get caught by what they start to say early. So if they model goes down a path that seems like a likely answer early on, and that ends up being a false lead or dead end, they will end up making up something plausible sounding to try and finish that line of thought even if it's wrong. This is why chain of thought and other "pre-answer" techniques improve results.
Because of the way transformers work, they have very good hindsight, so they can realize that they've just said things that are incorrect much more often than they can avoid saying incorrect things.
Does that extra information come from a separate process than the LLM network? If not then, assuming the same output is not guaranteed from the same input as per usual, then all bets are off correct?
Sorry for the late reply, but if you read this, there is research that shows that prompting a LLM to take variety of perspectives on a problem (IIRC it was demonstrated with code) then finding the most common ground answer improved benchmark scores significantly. So, for example if you ask it to provide a brief review and likelihood of the answer, and repeat that process from several different perspectives, you can get some very solid data.