Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Not really, looks like a ~40B class model which is very runnable.


It's actually ~13B class at runtime. 2B for attention is shared across each expert and then it runs 2 experts at a time.

So 2B for attention + 5Bx2 for inference = 12B in RAM at runtime.


Yeah. I just mean in terms of VRAM usage.


Yes, that's what I mean as well.

It's between 7B and 13B in terms of VRAM usage and 70B in terms of performance.

Tim Dettmers (QLoRA creator) released code to run Mixtral 8x7b in 4GB of VRAM. (But it benchmarks better than Llama-2 70B).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: