More

germanjoey · 2025-05-14T22:17:12 1747261032

TBH, the 2x-4x improvement over a naive implementation that they're bragging about sounded kinda pathetic to me! I mean, it depends greatly on the kernel itself and the target arch, but I'm also assuming that the 2x-4x number is their best case scenario. Whereas the best case for hand-optimized could be in the tens or even hundreds of X.

godelski · 2025-05-15T03:58:28 1747281508

I'm a bit confused. It sounds like you are disagreeing ("TBH") but the content seems like a summary of my comment. So, I agree.

Fwiw, they did say they got up to 20x improvement but given the issues we both mention this may not be surprising given that this seems to be an outlier by their own omission.

jaberjaber23 · 2025-05-15T08:36:44 1747298204

absolutely. it really depends on the kernel type, target architecture, and what you're optimizing for. the 2x-4x isn’t the limit, it's just what users often see out of the box. we do real-time profiling on actual GPUs, so you get results based on real performance on a specific arch, not guesses. when the baseline is rough, we’ve seen well over 10x

germanjoey · on Jan 14, 2025

This is really incredible, thank you!

germanjoey · on Dec 13, 2024

Sambanova's RDU is a dataflow processor being used for ML/AI workloads! It's amazing and actually works.

germanjoey · on Nov 19, 2024

Pretty amazing speed, especially considering this is bf16. But how many racks is this using? The used 4 racks for 70B, so this, what, at least 24? A whole data center for one model?!

aurareturn · on Nov 19, 2024

Each Cerebras wafer scale chip has 44GB of SRAM. You need 972 GB of memory to run Llama 405b at fp16. So you need 22 of these.

I assume they're using SRAM only to achieve this speed and not HBM.

germanjoey · on Oct 31, 2024

the title says "Cerebras Trains Llama Models"...

mentalically · on Oct 31, 2024

That's correct and if you read the whole thing you will realize that it is followed by "... to leap over GPUs" which indicates that they're not literally referring to optimizing the weights of the graph on a new architecture or freshly initialized variables on existing ones.

pama · on Oct 31, 2024

This is as clickbaity as it gets.

Trains has no other sensible interpretation in the context of LLM models. My impression was that they trained the models to be better than the models trained by GPUs, presumably because they trained faster and managed to train for longer than Meta, but this interpretation was far from the content.

Also interesting to see the ommission of deepinfra from the price table, presumably because it would be cheaper than Cerebras, though I didnt even bother to check at that point because I hate these cheap clickbaity pieces that attempt to enrich some player at the cost of everyone’s time or money.

Good luck with their IPO. We need competition but we dont need confusion.

mentalically · on Oct 31, 2024

What are you confused about? Their value proposition is very simple and obvious, custom hardware with a compiler that transforms existing graphs into a format that can run at lower cost and higher efficiency because it utilizes a special instruction set only available on Cerebras silicon.

fancyfredbot · on Oct 31, 2024

The title is clickbait but that's how marketing works whether we like it or not. The achievement is real - Cerberas improved their software and the inference is much faster as a result. I find it easy to forgive annoying marketing tactics when they're being used to promote something cool.

pama · on Oct 31, 2024

It is textbook bait and switch. If the achievemt is important, use the correct title. An advance in actual training performance or a better model is very important and interests a different set of people with deeper pockets than those who care about inference.

germanjoey · on Oct 25, 2024

They said in the announcement that they've implemented speculative decoding, so that might have a lot to do with it.

A big question is what they're using as their draft model; there's ways to do it losslessly, but they could also choose to trade off accuracy for a bigger increase in speed.

It seems they also support only a very short sequence length. (1k tokens)

bubblethink · on Oct 25, 2024

Speculative decoding does not trade off accuracy. You reject the speculated tokens if the original model does not accept them, kind of like branch prediction. All these providers and third parties benchmark each other's solutions, so if there is a drop in accuracy, someone will report it. Their sequence length is 8k.

germanjoey · on Oct 4, 2024

Simply increasing processing power for the AI isn't enough. Gameplay mechanics are intimately related to the capabilities of the AI.

For example, when they redesigned combat around the 1-Unit-Per-Tile (1UPT) mechanic for CIV 5, this crippled the ability of the AI to wage war. That's because even if a high-difficulty AI could out-produce the player in terms of military, they were logistics-limited in their ability to get those units to the front because of 1UPT. That means that the AI can't threaten a player militarily, and thus loses it's main lever in terms of it's ability to be "difficult."

Contrast this to Civ 4, where high-difficulty AIs were capable of completely overwhelming a player that didn't take them seriously. You couldn't just sit there and tech-up and use a small number of advanced units to fend off an invasion from a much larger and more aggressive neighbor. This was especially the case if you played against advanced fan-created AIs.

I'm hoping they get rid of 1UPT completely for Civ 7, but I have a feeling that it is unlikely because casual players (the majority purchaser for Civ) actually like that 1UPT effectively removes tactical combat from the game.

jltsiren · on Oct 4, 2024

1UPT added tactical combat to the game. Before Civ 5, the lowest level of warfare was operational. If you got your units close to the enemy, they were in position to fight. You didn't have to worry much about battlefield formations, terrain, coordinating the actions of different units, and so on.

This addition of tactical combat crippled the AI, because it doesn't understand the situation on the battlefield, and it's not good at making and adjusting plans.

danofsteel32 · on Oct 4, 2024

The combat system in Civ4 was deeper than you think. Stack composition, terrain, and positioning are crucial in MP games. This write-up [1] from a famous MP game where a 3v1 invasion was repelled by superior play shows how good the system was.

[1] https://sullla.com/Civ4/RBPB2-5.html

jltsiren · on Oct 5, 2024

It had depth, but no tactical combat. A stack is an operational unit. Tactics deals with what the individual units within the stack do once the fighting starts. Civ 7 is supposed to introduce commanders, which are effectively stacks for moving troops combined with more specialized great generals. You can get the troops more easily to the battlefield, but individual units still need to occupy separate tiles in the battle.

me_me_me · on Oct 4, 2024

I am not sure if I buy this resoning. While doom tile army is much easier to create, I found it hard to imagine major AAA game dev making same game for ever unable to create proper Ai that handles strategy and tactics with multi-tiled armies.

There are plenty of small games that handle complex armies fight with plenty units, choke-points and strategical and tactical views. Especially since the unit roaster in Civ games is quite limited in comparison to other strategy games.

germanjoey · on Sept 23, 2024

How are you verifying accuracy for your JAX port of Llama 3.1?

IMHO, the main reason to use pytorch is actually that the original model used pytorch. What can seem to be identical logic between different model versions may actually cause model drift when infinitesimal floating point errors accumulate due to the huge scale of the data. My experience is that debugging an accuracy mismatches like this in a big model is a torturous ordeal beyond the 10th circle of hell.

felarof · on Sept 24, 2024

Good question. We used a new AI+math-based testing tool (benchify.com) to run comparison tests, but we are working on building more robust infrastructure for this. Translating models from PyTorch to JAX is core to our strategy.

That said, this path is not uncommon (translating from one framework to another). HuggingFace translates Google's Gemma family models from JAX to PyTorch, and a ton of people use it.

credit_guy · on Sept 24, 2024

When you say "model versions", do you mean different quantizations of the model? Then it's not floating point errors that accumulate. Different quantizations of the model are different models. People will call such a model something like Meta-Llama-3.1-8B-Instruct--q4_0, claiming that it's just a "version" of the Meta-Llama-3.1-8B-Instruct. But it's just a lie. It's not the same model, and you should not expect the same results. There is no reason to debug the differences, what exactly would you expect to find, and what action would you envision to take once you find what you are looking for? However, is the quantized version still a useful LLM? Absolutely. Most people don't have an A100 to run the original model, so a quantized version is better than nothing.

srcreigh · on Sept 23, 2024

Very fascinating, can you explain more about a time when this happened?

Like what area was affected by fp errors, why were they introduced (was it like refactoring of pytorch code?), how was this determined to be the cause?

germanjoey · on Aug 29, 2024

Looks like some kind of power play...

Originally discussed here: https://news.ycombinator.com/item?id=41234180

PaulHoule · on Aug 29, 2024

My experience is that discussions of bylaw changes tend to get heated and that trying to change your bylaws is like Russian Roulette, that is maybe 1/6 of the time there is some disaster which is either the end of he organization or that results in a major loss of members.

germanjoey · on May 14, 2024

Is there a demo of a model visualized using this somewhere? Even if it's just a short video... it's hard to tell what it's like from screenshots.