Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

They're kind of bad at pretty much all languages, except simpler forms of english and Python. The tonality in the big LLM:s tends to be distinctly inhuman as well.

I suspect it'll be hard to find more material in some obscure, dying language than there is of either of those in the common training sets.



What is "they"? Are you saying transformer architecture somehow is biased towards English? Or are you saying that existing LLMs have that bias?

The only way this project is going to make sense will be to train it fresh on text in the language to be preserved, in order to avoid accidentally corrupting your model with English. If it's trained fresh on only target language content, I'm not sure how we can possibly generalize from the whole-internet models that you're familiar with.


I don't really care about the minutiae of the technical implementations, I'm talking from the experience of pushing text into LLM:s locally and remote, and getting translation in some direction or other back.

To me it doesn't make sense. It seems like an awful way to store information about a fringe language, but I'm certainly not an expert.


  and getting translation in some direction or other back.
This seem to make a lot of English speakers upset, that LLM outputs appear translated from perspectives of primarily non-English speakers. But hey, it's n>=2 even at HN now.


I don't know, the translation errors are often pretty weird, like 'hot sauce' being translated to the target equivalent of 'warm sauce'.

Since LLM:s work by probabilistically stringing together sequences of words (or tokens or whatever) I don't expect them to become fully fluent ever, unless natural language degenerates and loses a lot of flexibility and flourish and analogy and so on. Then we might hit some level of expressiveness that they can actually simulate fully.

The current hausse is different but also very similar to the previous age of symbolic AI. Back then they expected computers being able to automate warfare and language and whatnot, but the prime use case turned out to be credit checks.


What language have you tried that they're bad at? I've tried a bunch of European languages and they are all perfect (or perfect enough for me to never know otherwise)


Swedish, german, spaniard spanish, french and luxembourgish french.

Sometimes they do a decent translation, too often they trip up on grammar, vocabulary or just assuming that a string of bytes means the same thing always. I find they work best in extremely formal settings, like documents produced by governments and lawyers.


Have you had the opportunity to interact with less wrapped versions of the models? There's a lot of intentionality behind the way LLM's are presented from places like ChatGPT/DeepSeek/Claude, you're distinctly trying to talk to something that's actively limited in the way it can speak to you

It's not exactly nonexistant outside of them, but they make it worse than it is


Does it matter? Even most Chinese models are trained with <50% Chinese dataset last I checked, and they still manage to show AliExpress accent that would be natural for a Chinese speaker with ESL training. They're multilingual but not agnostic, they just can grow English-to-$LANG translation ability so long English stays the dominant and defining language in it.


I've run a bunch locally, sometimes with my own takes on system prompts and other adjustments because I've tried to make them less insufferable to use. Not as absurdly submissive, not as eager to do things I've not asked, not as censored, stuff like that.

I find they struggle a lot with things like long sentences and advanced language constructs regardless of the natural language they try to simulate. When it doesn't matter it's useful anyway, I can get a rough idea about the contents of documents in languages I'm not fluent in or make the bulk of a data set queryable in another language, but it's like a janky hack, not something I'd put in front of people paying my invoices.

Maybe there's a trick I ought to learn, I don't know.


Have you tried Mistral's models? They're explicitly trained on a bunch of languages, not only English.


Yes.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: