Mixtral 8x7b only needs 12B of weights in RAM per generation. 2B for the attenti...

filterfiber · on Dec 8, 2023

> in as little as 16GB of RAM with room to spare.

I don't think that's the case, for full speed you still need (5B*8)/2+2~fewB overhead.

I think the experts chosen per-token? That means that yes you technically only need two in VRAM memory+router/overhead per token, but you'll have to constantly be loading in different experts unless you can fit them all, which would still be terrible for performance.

So you'll still be PCIE/RAM speed limited unless you can fit all of the experts into memory (or get really lucky and only need two experts).

dkarras · on Dec 9, 2023

no doesn't work that way. experts can change per token so for interactive speeds you need all in memory unless you want to wait for model swaps between tokens.