> It baffles me that any maintainer would merge code like the one highlighted in the issue, without knowing what it does.
I don't know if it is relevant in any specific case that is being discussed here, but if the exploit route is via gaining access to the accounts of previously trusted submitters (or otherwise being able to impersonate them) it could be a case of teams with a pile of PRs to review (many of which are the sloppy unverified LLM output that is causing a problem for some popular projects) lets through an update from a trusted source that has been compromised.
It could correctly be argued that this is a problem caused by laziness and corner cutting, but it is still understandable because projects that are essentially run by a volunteer workforce have limited time resources available.
There is quite a difference between fixing grammar and the fuller rewording that is often used especially by LLM based writing tools. The distinction is much more of a grey area when you not talking about a language you are fluent in, because you don't know the difference between idiomatic equivalences and full-on rewording that will change your perceived tone⁰ - the tool being used could be doing more than you think and not in a good way.
And if you are using the tool, “AI” or not to translate it is even worse and you often only have to do on cycle of [your primary language] -> [something else] -> [your primary language] to see what a mess that can make.
I'm attempting to learn Spanish¹ and when I'm writing something, or practising something that I might say, I'll write it entirely away from tech (I have even a proper chunky paper dictionary and grammar guide to help with that!) other than the text editor I'm typing in, and then I'll sometimes give a tool it to look over. If that tool suggests what looks like more than just “that's the wrong tense, you should have an accent there, etc.” I'll research the change rather than accepting it as-is.
--------
[0] or even, potentially, perceived meaning
[1] I like the place and want to spend more time down there when I can, I even like the idea of living there fairly permanently when I no longer have certain responsibilities tying me to the UK², and I'd hate to be ThatGuy™ who rocks up and expects everyone else to speak his language.
[2] and the shithole it has the potential to become over the next decade - to the Reform supporters and their ilk who say, without any hint of irony, “if you don't like it why don't you go somewhere else” I reply “I'm working on that”.
True AI doing away with recruiters is something that I could live with. There are some groups of people I'd rather not interact with, and they are on that list! But dealing with glorified predictive text is worse than dealing with real recruiters.
Writing that like that makes you sound like one of those “I didn't get help when I was younger so why should anyone else get help now?” types who highlight their own entitlement and luck by trying to frame others as entitled.
You might not be, but it sounds that way to me.
And if you think this knee-jerk reaction is unfair, let that be a lesson to you! :)
It's like the Google¹ advert “if your phone can answer your friends text, shouldn't it, instead of them waiting or you”. No, it 'king shouldn't. And if I find they are using automation to talk to me, I'll talk to someone else. Or I'll bot up myself and have my people talk to their people…
--------
[1] I think it was one of theirs, could have been one of the Android phone makers that has gone all-in on nagging me to give their bot something to do with itself.
> zero desire to keep any copy of work code or other data on any personal device
Same. I won't even have Teams or Authenticator on my phone unlike most others here (though wrt Teams, that is at least as much about not wanting work to bother me as it is about the danger of data seepage). I need the authenticator to do the job, but I have an old factory-reset phone that has that (and, just in case, Teams) on it.
> But when I was younger? I could totally imagine getting a big juicy dataset like that and wanting a copy for myself.
I'm pretty sure I never would have done. I've always resisted knowing credentials and personal information that aren't mine (so if anything untoward happens with/using that information there is no way it can be my fault/doing, as well as the less selfish reasons) despite people falling over themselves to do things like tell me their passwords & such when they were wanting some for of tech support.
But I think there is a different attitude to data risk in that age group today. They've grown up in a world where very little is really private, and every app and its dog has wanted their contact details and other information (and all too often information about their friends & family), do the idea that data is a free-for-all is dangerously normalised in their heads.
I find older people are similarly very lax with their own data, in fact often being rather too trusting of others generally, but not so much with other peoples. There are a lot more people who are appropriately careful (or even paranoid) in their 30s/40s/50s (I'm late 40s myself) - I think we are lucky to be in the middle, being exposed to information dangers enough to not have that “naivety or age” and not desensitised by having lax information security pushed at us from an early age.
> But I think there is a different attitude to data risk in that age group today. They've grown up in a world where very little is really private, and every app and its dog has wanted their contact details and other information
Counterpoint from a UK/EU perspective.....
Anybody new being onboarded is given (company compulsory) GDPR training if their role involves any handling or processing of personal data whatsoever. Data security and privacy is being treated quite seriously here; though unfortunately not seriously enough IMO.
Counterpoint also from a UK perspective: unfortunately a lot of people give no more than lip service to that training, and there are a great many people who have been in that sort of role who have avoided taking part in it at all. It sometimes worries me how seriously some people don't take the matter, and how many see that sort of regulation as pointless “innovation” preventing inconvenience. Heck, I know one fool who gave “the overreach manifest in GDPR” as one of his reasons for voting for brexit.
My DayJob company, and most of the people working here, do have the right attitude, as do most of our clients (if only because of the potential punishments, both in terms of fines and a slapping from the court of public opinion, if something done wrong has signifiant repercussions), but I do worry about how many people and companies seem to not care at all.
To be fair, it is apparent the tide is turning and awareness of data privacy is growing; even if this is unfortunately due to the increasing damage data breaches are causing.
In situations where you have space CPU power but not spare GPU power because your GPU(s) & VRAM are allocated to be busy on other tasks, you might prefer to use what you have rather than needing to upgrade that will cost (even if that means the task will run more slowly).
If you are wanting to run this on a server to pipe the generated speech to a remote user (live, or generating it to send at some other appropriate moment) and your server resources don't have GPUs, then you either have to change your infrastructure, use CPU, or not bother.
Renting GPU access on cloud systems can be more expensive than CPU, especially if you only need GPU processing for specific occasional run tasks. Spinning up a VM to server a request then pulling it down is rarely as quick as cloud providers like to suggest in advertising, so you end up keeping things alive longer than absolutely needed meaning spot-pricing rates quoted are lower than you end up paying.
> I'm not sure why anyone would choose varchar for a column in 2026
The same string takes roughly half the storage space, meaning more rows per page and therefore a smaller working set needed in memory for the same queries and less IO. Also, any indexes on those columns will also be similarly smaller. So if you are storing things that you know won't break out of the standard ASCII set⁰, stick with [VAR]CHARs¹, otherwise use N[VAR]CHARs.
Of course if you can guarantee that your stuff will be used on recent enough SQL Server versions that are configured to support UTF8 collations, then default to that instead unless you expect data in a character set where that might increase the data size over UTF16. You'll get the same size benefit for pure ASCII without losing wider character set support.
Furthermore, if you are using row or page compression it doesn't really matter: your wide-character strings will effectively be UTF8 encoded anyway. But be aware that there is a CPU hit for processing compressed rows and pages every access because they remain compressed in memory as well as on-disk.
--------
[0] Codes with fixed ranges, etc.
[1] Some would say that the other way around, and “use NVARCHAR if you think there might be any non-ASCIII characters”, but defaulting to NVARCHAR and moving to VARCHAR only if you are confident is the safer approach IMO.
utf16 is more efficient if you have non-english text, utf8 wastes space with long escape sequences. but the real reason to always use nvarchar is that it remains sargeable when varchar parameters are implicitly cast to nvarchar.
UTF-16 is maybe better if your text is mostly made of codepoints which need 3 UTF-8 code units but only one (thus 2 bytes) UTF-16 code unit. This is extremely rare for general text and so you definitely shouldn't begin by assuming UTF-16 is a good choice without having collected actual data.
The old defense of 16-bit chars, popping up in 2026 still! Utf8 is efficient enough for all general purpose uses.
If you're storing gigabytes of non-latin-alphabet text, and your systems are constrained enough that it makes a difference, 16-bit is always there. But I'd still recommend anyone starting a system today to not worry and use utf8 for everything.j
What do you mean with non-english text? I don't think "Ä" will be more efficient in utf16 than in utf8. Or do you mean utf16 wins in cases of non-latin scripts with variable width? I always had the impression that utf8 wins on the vast majority of symbols, and that in case of very complex variable width char sets it depends on the wideness if utf16 can accommodate it. On a tangent, I wonder if emoji's would fit that bill too..
I am not sure if you mean me, as I just asked a question. I wonder what the best way is to handle this disparity for international software. It seems like either you punish the Latin alphabets, or the others.
> I wonder what the best way is to handle this disparity for international software. It seems like either you punish the Latin alphabets, or the others.
there are over a million codepoints in unicode, thousands for latin and other language agnostic symbols emojis etc. utf-8 is designed to be backwards compatible with ascii, not to efficiently encode all of unicode. utf-16 is the reasonably efficient compromise for native unicode applications hence it being the internal format of strings in C# and sql server and such.
the folks bleating about utf-8 being the best choice make the same mistake as the "utf-8 everywhere manifesto" guys: stats skewed by a web/american-centric bias - sure utf-8 is more efficient when your text is 99% markup and generally devoid of non-latin scripts, that's not my database and probably not most peoples
> sure utf-8 is more efficient when your text is 99% markup and generally devoid of non-latin scripts, that's not my database and probably not most peoples
I think this website audience begs to differ. But if you develop for S.Asia, I can see the pendulum swings to utf-16. But even then you have to account for this:
«UTF-16 is often claimed to be more space-efficient than UTF-8 for East Asian languages, since it uses two bytes for characters that take 3 bytes in UTF-8. Since real text contains many spaces, numbers, punctuation, markup (for e.g. web pages), and control characters, which take only one byte in UTF-8, this is only true for artificially constructed dense blocks of text. A more serious claim can be made for Devanagari and Bengali, which use multi-letter words and all the letters take 3 bytes in UTF-8 and only 2 in UTF-16.»¹
In the same vein, with reference to³:
«The code points U+0800–U+FFFF take 3 bytes in UTF-8 but only 2 in UTF-16. This led to the idea that text in Chinese and other languages would take more space in UTF-8. However, text is only larger if there are more of these code points than 1-byte ASCII code points, and this rarely happens in real-world documents due to spaces, newlines, digits, punctuation, English words, and markup.»²
The .net ecosystem isn't happy with utf-16 being the default, but it is there in .net and Windows for historical reasons.
«Microsoft has stated that "UTF-16 [..] is a unique burden that Windows places on code that targets multiple platforms"»¹
the talk page behind the utf-16 wiki is actually quite interesting. it seems the manifesto guys tried to push their agenda there, and the allusions to "real text" with missing citations are a remnant of that. obv there's no such thing as "real text" and the statements about it containing many spaces and punctuation are nonsense (many languages do not delimit words with spaces, plenty of text is not mostly markup, and so on..)
despite the frothing hoard of web developers desperate to consider utf-16 harmful, it's still a fact that the consortium optimized unicode for 16-bits (https://www.unicode.org/notes/tn12) and their initial guidance to use utf-8 for compatibility and portability (like on the web) and utf-16 for efficiency and processing (like in a database, or in memory) is still sound.
Interesting link! It shows its age though (22 years), as it makes the point that utf-16 is already the "most dominant processing format", but if that would be the deciding factor, then utf-8 would be today's recommendation, as utf-8 is the default for online data exchange and storage nowadays, all my software assumes utf-8 as the default as well. But I can't speak for people living and trading in places like S.Asia, like you.
If one develops for clients requiring a varying set of textual scripts, one could sidestep an ideological discussion and just make an educated guess about the ratio of utf-8 vs utf-16 penalties. That should not be complicated; sometimes utf-8 would require one more byte than utf-16 would, sometimes it's the other way around.
The non sargeablilty is an optimizer deficiency IMO. It could attempt to cast just like this article is doing manually in code, if that success use index, if it fails scan and cast a million times the other way in a scan.
implicit casts should only widen to avoid quiet information loss, if the optimizer behaved as you suggest the query could return incorrect results and potentially more than expected, with even worse consequences
It should not return incorrect results, if the nvarchar only contains ascii it will cast perfectly, if it doesn't then do the slow scan path, it's a simple check and the same work its doing for every row in the current behavior except one time and more restricted. Can you give me an example of an incorrect result here?
I am not talking about the default cast behavior from nvarchar to varchar, but a specific narrow check the optimizer can use to make decision in the plan of ascii or not with no information loss because it will do the same thing as before if it does not pass the one time parameter check.
By far the most common cause of this situation is using ascii only in a nvarchar because like say in this example the client language is using an nvarchar equivalent for all strings, which is pretty much universal now days and that is the default conversion when using a sql client library, one must remember to explicitly cast rather than the db doing it for you which is the expected behavior and the source of much confusion.
This would be purely an optimization fast path check otherwise fall back to the current slow path, correct results always with much faster results if only ascii is present in the string.
reply