The first one is the only one that can accurately describe where the user intended for the message to land in the conversation. All the others are subject to network delays, so if you order messages by anything other than the first one, you are going to get a bad experience if you try to participate in a busy conversation with a poor connection.
The first one is entirely un-trustable, because the user-submitted timestamp can be at any arbitrary point in time, from t=0 to t=infinity. So the receiving system can use that timestamp as an important bit of signal, but it can't really treat it as strictly authoritative, at least not if it expects to maintain a coherent set of events overall.
Exactly, that's why I'm saying it's a more challenging problem than it appears and there's no one solution that always gives the best experience in every case. I personally think a hybrid approach of allowing some limited discrepancy between user and server timestamps is probably the best you can do.