Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

You don't need Parallel corpora for all language pairs in a "predict the next token" LLM. What I'm saying is that if an LLM is trained on English, French and Spanish and there is Eng to French data, you don't need Eng to Spa data to get Eng to Spa translations.


How would an LLM figure out what words to translate animal sounds to? Where does it learn that information? We don't know what animals are communicating if they do have a language of sorts. There's no mapping.


Potentially the same way it knows how to translate concepts with no mappings in that language pair in the dataset. Like i said, not every language in an LLM's corpus has something in another language to map to.


Spanish and French are both romance languages and will have massive token overlap. Not likely to be so lucky with whale songs.


It's not about romance or no romance. Same with Korean/Mandarin or any distant human language.

>Not likely to be so lucky with whale songs

Maybe. Maybe not




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: