Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I wonder if instead (or in parallel), Norway should build a set of training data and share it (for free) with all the model builders.

Seems like making the frontier models know Norwegian and their culture is a better (or additional!) way to reach the end they are going for here.

 help



The frontier models know Norwegian just fine. They can also adapt to Norwegian dialects, and even ape old Norwegian fairly well.

E.g. I had Claude describe the novel "De knyttede næver" from 1911 in Norwegian orthography ca. 1911, as it's a novel I've read, and it does a good job.

What it lacks is an understanding of Norwegian literature, culture and history. It had to look up "De knyttede næver", which was one of the best-selling Norwegian novels around the time it was published before I'd get anything out of it (ChatGPT does better; in thinking mode in particular it gives a detailed summary).

While not exactly well known today, the author was a prominent newspaper journalist for decades, and the novel series is well enough known that e.g. there's a Norwegian singer that took his stage name after the protagonist, and it was covered in Norwegian papers and books for decades (partly because of controversy over the authors political views and how they coloured his novels), so it does feel like a reasonable test that reveals a quite significant knowledge gap.

I do agree with you that it'd be better if the data set from the national library was made more accessible, though it seems a major addition here is that they have a deal to train on copyrighted data locked away in their archives that they have limitations on the use of.

But even just making the out of copyright data in their collections would be a great start.


Odd, I'd imagine Wikisource (in many/all languages) would be part of training data for all LLMs with SOTA ambition?

https://no.wikisource.org/wiki/De_knyttede_n%C3%A6ver


You'd think so. It seems like there are a lot of odd gaps like that.

I also have a favourite English language PhD thesis I ask every new model about that they still struggle to find even though there's a Wikipedia article about it that links a blog post I wrote about it.

Anyone who thinks they've exhausted even publicly crawlable resources should ask them about some obscure stuff.


the models don't retain their full training data set

No, but they do retain enough that it is interesting what they fail to retain.

you might be surprised if you take this approach.. give key words and phrases in small amounts, each sentence of a prompt building on a previous sentence. Take a an example that is not very hard, like Lewis Carrol Alice in Wonderland original text. Although a quick question might get things sort of wrong, or miss details, if you guide the LLM to a certain part of the story, then a certain set of characters in that part of the story, then a certain statement or dramatic moment with those characters in that part of the story, you might get very specific detail that is close to line-by-line accurate. On the other hand, if you ask a quick, ordinary question about the same part of the story without supplying context and character names, you get something equally vague. YMMV

For the PhD thesis in question, I've actually tested a lot of requests about different parts of it, and both Claude and ChatGPT still draws a total blank if you don't let them do searches.

Why should they share all this data with the greedy american corporations that are stealing everyones data for their own profit? Much better to keep the legal agreement with the national institutions and possibly develop something actual useful to their own country.

You are contradicting yourself. If you're hoarding the data for yourself you're not going to develop something useful. Sharing the data means that it will be integrated into the big LLMs, which will be useful "for their own country".



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: