Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Norway's 2 petabytes of Huawei flash storage and LLM training (blocksandfiles.com)
180 points by rbanffy 7 hours ago | hide | past | favorite | 87 comments
 help



> Marius Husnes, the Head of IT Platform at the library (Nasjonlbiblioteket) discussed the project at Huawei’s ID Forum 2026 in Paris, saying that no commercial LLM provider was developing a local (Norwegian) language LLM. He asserted that any country with its own language that did not have a sovereign LLM trained in that language was at a disadvantage as a globally trained, English-speaking LLM would not know about that country’s history, news and culture that was described in the local language.

I am not overly confident that Marius Husnes knows what he’s talking about here.


It sounds plausible enough to get subsidies.

How true is this statement: "He asserted that any country with its own language that did not have a sovereign LLM trained in that language was at a disadvantage as a globally trained, English-speaking LLM would not know about that country’s history, news and culture that was described in the local language."

I thought all big players already train on basically everything remotely available to them no matter the language or quality, so his take sounds like an opinion formed in the early days of generally available LLMs.


If you want LLMs to have knowledge of the Norwegian language, wouldn't the most obvious thing to do be to build a good training dataset and make the dataset widely available? Why go to the expense of training your own model, especially when it will be inferior to state of the art models.

I task GPT/Claude with researching stuff that pertains to very specific cultural or legal aspects in French politics, on a daily basis. Even though French is a way more common language globally than Norwegian, these models still haven't figured out that, no matter the language I myself speak to them (German or English depending on my mood) their web searches need to be done in French to return reasonable results. I have to remind them every time lest they come back with "uh, didn't find anything relevant, here take some hallucinations instead."

So, given the anglo-centrism of current models, my confidence in American providers giving any shits about non-american users/use-cases is pretty low. And lower the smaller the language community is.


Aren’t you already using English in the LLM convo? Telling the model to use French for research or to find resources in French seems like a reasonable step.

If you’re doing this on a daily basis, then you should have an AGENTS.md that accumulates directional instructions like this.

This is how you use the tool correctly.

There’s this weird pattern I’ve noticed where people expect LLMs to require zero effort or proficiency on their part, and when the LLM isn’t perfect without it, of course it wasn’t; LLMs suck.


The issue is that French, Italian, African, Japanese people shouldn't have the inconvenience of instructing the LLM tool to get the basic facts about their own culture. They should use an LLM that has already been trained like that by default. Nobody has obligation to use a tool that thinks it is talking to an American. If I go to Google for example I want to get facts about my own country in my own language.

Wouldn't those people be asking the questions in their own language in the first place? The model will reply in the language you use. This thread is about people asking for information about a language that is not the one they are messaging the LLM in

> Nobody has obligation to use a tool that thinks it is talking to an American.

Then add top-level instructions saying what country you're from, what country you live in now, and which language you speak. This isn't that hard.


>Nobody has obligation to use a tool that thinks it is talking to an American

Very very emphatic agree from my end, thanks.


If you ask in French, it searches in French, right?

I have the opposite problem, where I'll ask in English, about something in a foreign country, the results it finds will all be in that foreign language, and the LLM will switch languages and respond in that language (which I don't speak).

So then I have to ask it "can you repeat that in English please."

I keep waiting for the new GPT-Definitelty-AGI-For-Real-This-Time to fix it but it's still there.


What incentives does OpenAI have to make sure the AI actually works well with Norwegian beyond capturing a (small) Norwegian market? What incentives do they have to take Norwegian values into consideration, or to preserve Norwegian culture into the future? The matter is also a question of national sovereignty, so to simply release the data and nicely ask foreign companies to solve the problem for you, would be a fool's move

It's also a bit funny because Norway definitely has enough money to hire a team of Anthropic's best to go out there and train them a model that does whatever they want. They probably have enough money to fund their own Anthropic competitor.

absolutely. somebody online was wanting an LLM with Georgian language support, and that's exactly what i suggested: start digitizing Georgian text.

Yeah, was about to comment that too, instead of training a new model and new weights exclusively for Norwegian (and expecting/wanting every other small/medium-sized country to do the same) which seems infinity harder, they could have made high quality transcriptions and translations of the stories currently described only in Norwegian into English, and making it all public. I guess there still would be a worry that it'd be counted as "less important" compared to other history, news and culture about other countries.

> high quality transcriptions and translations of the stories currently described only in Norwegian into English

You make it sound like an easier task than training an LLM. I'd argue it's not obvious, and would assume the contrary.


> Why go to the expense...

Answer: idiocy of decision makers and the desire to get resources by those who created the proposal.

I assumed Scandinavia has better decision processes but apparently I was wrong.


Not remotely true in my estimation. I don't really speak Norwegian, but I do speak Swedish(which means I mostly understand Norwegian as they're very similar). Every model I've tried speaking Swedish to does it perfectly. I'd be surprised if the same isn't true for Norwegian already

different models have been very different in this way.. almost ten years ago the French made a very large effort to capture languages.. the release notes I read at the time IIR had quite a few languages from South Asia / India, and in Africa. The language that was prominently missing was German IIR. I cannot say for the 2025-2026 models since so much has happened.. but models are not equal.

Current-best models are pretty fluent at major languages and cultures, so it's untrue at least for the "any" qualifier. Performance is barely affected or might be even better sometimes. However English patterns can subtly leak into native patterns of other languages. It's obviously very different for low-resource languages, but to improve them you need more data, not a new model.

>Current-best models are pretty fluent at major languages and cultures

strong disagree on that one. As a German interacting with ChatGPT, even in German it gives me the feeling of talking to the Pluribus people, which reminds me of an anecdote of Walmart failing in Germany because people were freaked out by the constantly upbeat, smiling employees.

Understanding a culture is a very different task than translating the syntax of a text, and these systems might be capable of syntactic fluency but they do not really understand culture. You have to metaphorically abuse these models until they stop sounding like the crossover of a HR department person and a Mormon missionary


yeah and alignment is all about how to be less evil which is no easy job... I can just imagine Chinese LLM renders 1989 tianmen square as an incident orchestrated by CIA which CCP successfully thwarted etc etc

I'm a Norwegian, and I use the national library almost every day for searching through texts. They have truly one of the best working user interfaces (and functionality) for searching through the massive amounts of text.

It's really fantastic. I just wished there were fewer restrictions on the content that is accessible.

(a lot is only accessible from Norwegian IP addresses, so it's one of the main reasons I maintain a VPN as I'm Norwegian but live in the UK; a second set is only available from the IP addresses of libraries or research institutions - still huge amounts that are generally available, though)


The lack of a universal search engine is very frustrating. Why can't I search within TV subtitles?

> The Olivia system is an HPE Cray Supercomputing EX system, with 448 GPUs and 64,512 CPU cores.

Training a sovereign LLM with this meager hardware as opposed to a LORA on some open source model seems like a huge mistake and a potential red flag.

There is no way these people have the resources to train a fully fledged LLM, so claiming that is their goal makes me think they don't intend for the LLM to be useful.

Which begs the question, whose money are they wasting - and why?


It may not be useful to anyone outside, but it's possible that one of the goals is institutional learning (that is, embedding the knowledge in how to build LLMs in an organization).

Even though it's nominally the national library behind this, they were probably chosen (as per the article) because they legally own and can use all NO material for this end. I'd guess researchers from related entities like unis will be involved in the process.


They successfully have made PoC finetunes before, so the next step is training fully fledged LLMs.

I don’t think they aim to anything worthwhile. The finetunes were incredibly broken. I’m guessing it’s more about having the method to do it. I’m not convinced it’s super useful but I’m not one to decide who gets to do what with the research funds.

One finetune I tried did make fun of humans expressing their feelings in the chat. Often.

One other finetune did hallucinate that it was a doctor and my baby had terrible diseases, every time I just wrote "hei" (with a generic neutral system prompt that likely triggered this behaviour though).

I think Olivia is big enough for what it’s used for. In my opinion it’s better to stay up to date and not waste too much money on hardware at the moment.


> this meager hardware

> they wasting - and why?

i18n language models are not area something frontier labs are focusing ton of resources on? ( certainly not in Norwegian)

The corpus of content in Norwegian - may not require very large clusters, or even if it does, this is best that the library could do, it would be certainly more than anyone else is investing in Norwegian models

SOTA models do not have the access to the quality of content that the national library does? The article mentions licensing with newspapers specifically, and the library has access to its own content archive.

English and Norwegian are not closely related language families, perhaps LoRA is not best approach?

I am curious if there is published research on how well localization works with LoRA depending on how far off the target language grammar/vocabulary is from English.

Projects like this typically have more than one objective and are not only building SOTA project, but is also to build/train foundational local talent , similar to universities launching satellites .


> English and Norwegian are not closely related language families, perhaps LoRA is not best approach?

Yes, they are. English is a West Germanic language. Norwegian is a North Germanic language. The French vocabulary in English obscures it a bit, but the two languages have similar grammar and the vocabulary has a huge number of close cognates.

E.g. day -> dag, ship -> skip, apple -> eple, cow -> ku (which makes more sense when you pronounce them correctly out loud), bairn (child; mostly Scotland and Northern England) -> barn, hop -> hopp, yule -> jul just to give a random selection of English Germanic words.

But more than that, the frontier models both a) knows Norwegian quite well, b) certainly knowns German and Dutch well, and there's a continuum of language transfer around the North sea especially when accounting for sounds rather than modern orthography, e.g. to take a couple of examples from above: ship -> schip -> Schiff -> skib -> skip; day -> dag -> Tag -> dag). The "jump" to Dutch already weeds out most of the French. A lot of modern Norwegian orthography comes from Danish, which again shares more than modern Norwegian does with German.

Knowing any of these helps a lot with learning Norwegian and vice versa. E.g. I'm Norwegian, I've never learnt Dutch, but I have learnt English and German, and I can read Dutch fairly well from that alone.


This makes me deeply curious about how LLMs understand language. Do LLMs relate cognates more than words that are dissimilar in different languages? I wonder if that plays some role in the effectiveness of tokenization.

I have no idea if the similar spelling will somehow help - I used that mostly because it's a simple way if illustrating the close relationship, but I suspect you'd find that the meanings of closely related words are likely to more directly overlap.

The grammar is perhaps more likely to help. Similar word order etc. Even weirdness like German - my only top grade on a German essay in school was one where I on purpose ignored what I thought I knew about German and tried to evoke "old fashioned" Norwegian. The result was guessing at a bunch of grammatical structures that I didn't know if was valid German. Turned out I was right about most of it - century old Norwegian was far closer to century old Danish, was a lot closer to valid German, and enough so to impress my teacher enough to overlook a number of orthographic mistakes.


"Training a sovereign LLM with this meager hardware"

Norway has a sovereign fund worth O[MS|Apple|etc] except it is largely in readies and not pixie dust.

Whilst the UK frittered away North Sea oil profits, Norge squirreled them away instead.

So, if the grand dream of LLMs and AI does actually come to some sort of fruition and not simply another case of the Emperor's New Clothes combined with some lovely tulips and a dotcom boom and bust, then Norge can simply stuff shit loads of cash into buying whatever they need. Cash is king after all.

The beast they have described here is just a library system. I think I'd like my country's (UK) library system to have resources like that.

I don't think you are asking the right question: When you say "meager", I see "rather impressive PoC from a well resourced organisation"

You say tomato ...


The reason they have the largest sovereign wealth fund (aside from getting it right in the 80s, unlike the UK), is that there is quite a bit of regulation around where and how the money is invested.

It is run to maximise growth for example, so even though Norway is way ahead with electric car usage and infrastructure (presumably because they have a climate likely to be most affected by global warming/heating) their fund still invests in fossil fuels as they are a profit/growth opportunity.

Anyway, i don't think it's as easy as "simply stuff shit loads of cash into buying whatever they need". I believe there would be a serious political discussion needed for that to happen.


DeepSeek claims to have trained on something like 2k H800, this is ~0.5k GH200 … it’s not nothing. Sure they’re not going to _serve_ it at scale, but that’s not the point?

Also the line between “finetuning a base model” and “man this is a real good initialization” gets pretty blurry at scale.

Altogether a pretty presumptuous take.


The largest problem is available training data actually.

They have already done experiments with dittrent sub 10b models with both fine-tuning and fully from scratch. And last I check the fully from scratch captured the language in a better way.


That's what they have access to right now. I am sure that will change in the future as the project progresses.

What do you suggest, that they stop and wait until they have the right HW?


Also, it's Norway...

"Norway's sovereign wealth fund, officially known as the Government Pension Fund Global, is the world's largest sovereign wealth fund with assets exceeding \(\$2\) trillion. Established in 1990 and managed by Norges Bank Investment Management, it was created to channel surplus petroleum revenues into long-term global investments to benefit future generations."


> Which begs the question, whose money are they wasting - and why?

Norway is better run as a country than 99% of the countries on the planet, including the one that invented current LLM tech, so I'd give them the benefit of the doubt.


> meager hardware

Qwen was made on a cluster about that size.

And this is before anybody ever thought about optimizing the training process. (Currently it's just pytorch analyst-as-coder slop, with extremely overprovisioned quantizations, etc.)


I wonder if instead (or in parallel), Norway should build a set of training data and share it (for free) with all the model builders.

Seems like making the frontier models know Norwegian and their culture is a better (or additional!) way to reach the end they are going for here.


The frontier models know Norwegian just fine. They can also adapt to Norwegian dialects, and even ape old Norwegian fairly well.

E.g. I had Claude describe the novel "De knyttede næver" from 1911 in Norwegian orthography ca. 1911, as it's a novel I've read, and it does a good job.

What it lacks is an understanding of Norwegian literature, culture and history. It had to look up "De knyttede næver", which was one of the best-selling Norwegian novels around the time it was published before I'd get anything out of it (ChatGPT does better; in thinking mode in particular it gives a detailed summary).

While not exactly well known today, the author was a prominent newspaper journalist for decades, and the novel series is well enough known that e.g. there's a Norwegian singer that took his stage name after the protagonist, and it was covered in Norwegian papers and books for decades (partly because of controversy over the authors political views and how they coloured his novels), so it does feel like a reasonable test that reveals a quite significant knowledge gap.

I do agree with you that it'd be better if the data set from the national library was made more accessible, though it seems a major addition here is that they have a deal to train on copyrighted data locked away in their archives that they have limitations on the use of.

But even just making the out of copyright data in their collections would be a great start.


Odd, I'd imagine Wikisource (in many/all languages) would be part of training data for all LLMs with SOTA ambition?

https://no.wikisource.org/wiki/De_knyttede_n%C3%A6ver


You'd think so. It seems like there are a lot of odd gaps like that.

I also have a favourite English language PhD thesis I ask every new model about that they still struggle to find even though there's a Wikipedia article about it that links a blog post I wrote about it.

Anyone who thinks they've exhausted even publicly crawlable resources should ask them about some obscure stuff.


the models don't retain their full training data set

you might be surprised if you take this approach.. give key words and phrases in small amounts, each sentence of a prompt building on a previous sentence. Take a an example that is not very hard, like Lewis Carrol Alice in Wonderland original text. Although a quick question might get things sort of wrong, or miss details, if you guide the LLM to a certain part of the story, then a certain set of characters in that part of the story, then a certain statement or dramatic moment with those characters in that part of the story, you might get very specific detail that is close to line-by-line accurate. On the other hand, if you ask a quick, ordinary question about the same part of the story without supplying context and character names, you get something equally vague. YMMV

As a Norwegian this sounds like a mistake. Who will use this LLM? Where? For what? The underlying data could be made more easily searchable and digestible for agents in general if the goal is better knowledge of Norwegian culture.

I agree in principle.

That said, they are quite limited in what they are allowed to share of in-copyright works, and nb.no is a fantastic resource as it is (though you'll need a Norwegian IP address for too much of it - it's one of th main reasons I maintain a VPN) - if they are allowed to make it accessible there, it'd be great.

But they also have vast amounts of out-of-copyright data that I hope they'd make more easily accessible...


Hard disagree. This is the first step not the last and proves to other countries that this can be done.

This model is going to start miles behind the frontier and the gap will only grow.

Exactly, if there's one thing transformers are good at it's translation. One I've found particularly nice: any question ChatGPT can answer in English it can answer in French. I'm assuming Norwegian too. So there's no point.

There's quite a bit more to culture and language than just being able to have transformers come up with believable language and/or dialect.

The point is that norway willl have its own LLM. And will not have dependencies to another state or private company. The goal is not to be the best model. But to have a model that include more Norwegian data then other LLM and that it's not screwed against other sources.

Yes transformers are great at translation as that is their purpose.

LLMs are not great at preserving cultural uniqueness and diversity. Take how “delve” has reentered the lexicon because the human assessors for pre training dialect of English uses “delve” a lot.

There is a lot of benefits to training specifically for a unique culture with unique norms to preserve the culture as we increasingly rely on LLMs.

https://www.scientificamerican.com/article/chatgpt-is-changi...


They're only good at it because they were trained on massive amounts of English and French data.

Not really true.

Both Claude and ChatGPT can translate into minor dialects of Norwegian they will have seen very few works in because very few printed works exist in them.

E.g. I've tested both my local spoken dialect, which is rarely written, and a sociolect used by a 1970's Maoist group consiting of a few hundred people, where most of the printed material consists of novels from a couple of ex-members that became authors.

In the latter case, it claimed to not know, but was able to get a good match from just a description.

I also just had it ape Norwegian orthography from the 1910's by having it look up the rules and translate a text it had first translated from English to modern Norwegian, and it did just fine.

They will have seem some work in these dialects, but mostly it transfer really well to know related languages (English, Dutch, German, Swedish, Danish, roughly form a continuum from least in common to most in common with modern Norwegian; they all share vocabulary and significant parts of grammar with Norwegian), and then a relatively limited exposure to Norwegian itself is sufficient to do fairly well.

They're also really good at "style transfer" of text in the form of tweaking orthography, word order, and minor grammar changes from descriptions and examples.

(incidentally, the latter is one way of getting an LLM to sound a lot less like an LLM)


Model can speak Lithuanian too, but with a Russian accent which is a big taboo for us.

>As Husnes put it; Norway is a small country solving a problem every non-English-speaking nation will face: how do you build AI that reflects your language, your culture and your history? AI needs custodians, not just builders.

I'm afraid the answer is, mostly you don't.

Such a thing requires strong political will that, at least in my environment, seems basically impossible to align.

The costs are prohibitive, but beyond that, the type of person who cares about local representation like that is either completely fine with letting foreign companies implement it (after all, you can use ChatGPT in Basque if you want to) or is against the idea of AI altogether.


I guess it's subject to debate whether the cost indeed is prohibitive in the case of Norway. They are a small but extremely wealthy country - after all, they currently hold the equivalent of 1,5% of all the listed companies globally through the investments of their sovereign wealth fund.

I'm sure if Norway approached the American labs with goal of making a curated datasets for training, they would absolutely get in the training door, and those models would likely run circles around anything that could be domestically done.

That being said though, I can feel you cringing through the screen.


>That being said though, I can feel you cringing through the screen.

Then I failed to express myself in writing. I'm definitely a fan of this kind of initiative and am not happy with the type of viability I think they have.

I might very well be projecting a whole lot of local dynamics of national identity, politics and culture though.


This can’t be right. 2 PB of flash is like $200k. It’s within reach of many individuals. Then again I guess you don’t need that much storage so maybe it is.

More like $1M at current prices at this scale / level of performance.

If you go with HDD arrays probably $50k


Boy pricing is pretty nuts these days. I have half a petabyte in Seagate enterprise drives myself and I didn’t pay anything close to that to acquire it. Such a pity about the flash storage. 2 years ago we built 200 TiB or something of flash using Samsung PM1633 or something and it was a fraction of the cost per gigabyte that $1m would imply.

Also my first thought: "Is that... a lot?"

You can put 6PB (244TB * 24) into a single box these days.


Your numbers are a little off but the point remains- 2PB is nothing, not newsworthy imo. What’s special about this?

What's special about it is not the flash but training an LLM based on the content, much of which is still in copyright and which the library has restrictions on how they are allowed to use (irrespective of the legal position of training on it) and which required an agreement with the copyright holders.

How about that, they actually asked for permission to use data and the companies said yes.

> He asserted that any country with its own language that did not have a sovereign LLM trained in that language was at a disadvantage as a globally trained, English-speaking LLM would not know about that country’s history, news and culture that was described in the local language.

I don’t know this is true. But whatever sounds true enough and gets funding seems to be what flies these days.


They made the cultural case, you have no idea how strong this is in places like quebec, nordics, france, russia etc

Can confirm that. Norway may have a small population, but if you live there you'll think it's truly the center of the world (aside from the US. Norwegians love America)

This is how much storage the average r/datahoarder user has in their basement. Fewer than 100 hard drives.

But not in flash. I have an appreciable fraction of that but in spinning rust.

so now Huawei is not a threat to 'democracy' anymore?

That's about 350MB per capita. Humans can produce 2-6kb per hour. That's 13 years of non-stop typing. Wonder where it all comes from. I guess it's websites that aren't compressed / extracted.

It's a legal deposit library, same as e.g. Library of Congress. Which means almost every published book, magazine, and newspaper and many other works published in Norway, as well as large collections of Norwegian works published abroad (such as thousands of Norwegian-language newspapers published by the Norwegian immigrant communities in the US) for many decades and a large proportion of the same from the last 200+ years are stored there.

They do also crawl websites (or at least did) in the .no tld.


384 core cpu cluster? 2 petabytes?

Dell just launched a 2U that fits almost 10 petabytes in it. It's probably not 384 core capable but that is very doable right now, Epyc chips are 192 cores each! https://www.techradar.com/pro/dell-launches-record-shatterin...


5x 400gbit running to a 2U box whoa, the PCI lanes must have heat shielding.

More seriously there is a sensibility limit on extreme density where it's not needed. The idea that you're just going to magically get 2 TBit/s out of those ports seems unlikely even with tweaked software, and you're stuck with a power and comms hotspot that's liable to dictate the remainder of your network design.

At max utilisation that 2U would take 12 hours to drain, and only 12 hours assuming peak and likely unachievable throughput and the box otherwise being completely out of service. Not a great start


That's the in-house preprocessing hardware, not what they're training on.

Yes!

It's still a weird article, to highlight a "big" storage appliance. Having all that NVMe local feels like it would be much much much much faster.


2 PB? They will not come close to training in on that amount. Maybe years from now.

Think they will not train on the dull 2TB but use that as the data lake to start and then apply a more targeted approach.

if you read the article 2pb is available as flash storage in the data pipeline, used to dedupe, clean, normalize, etc, for training from 60pb of raw data.

Could probably LoRA with that

Ad for Huawei?

Even entire governments are captured by a mild LLM psychosis. Which is sad in the case of Norway. I lived in Norway for two years and always found their government to be highly rational, this is not a rational use of public funds (but I suppose they have plenty of capital).

Western society is completely captured by this form of psychosis and its going to bite us in the a* very soon.

I firmly believe all the Boomer leaders throughout the world are being sold a bag of lies by technocrats that "AI", specifically LLMs, are going to cure disease and death and therefor they are willing to handover all control to the technocrats. Fckin croakers at it again.


Ehhh. None of this sounds right. Translation problems maybe. Lack or technical detail understanding maybe... I don't know. Probably not news.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: