Not really, looks like a ~40B class model which is very runnable.

MacsHeadroom · on Dec 8, 2023

It's actually ~13B class at runtime. 2B for attention is shared across each expert and then it runs 2 experts at a time.

So 2B for attention + 5Bx2 for inference = 12B in RAM at runtime.

brucethemoose2 · on Dec 8, 2023

Yeah. I just mean in terms of VRAM usage.

MacsHeadroom · on Dec 9, 2023

Yes, that's what I mean as well.

It's between 7B and 13B in terms of VRAM usage and 70B in terms of performance.

Tim Dettmers (QLoRA creator) released code to run Mixtral 8x7b in 4GB of VRAM. (But it benchmarks better than Llama-2 70B).