I think the current approach — train 7b models and then do MoE on them — is the ...

sbierwagen · on Dec 9, 2023

My years-old M1 macbook with 16GB of ram runs them just fine. Several Geforce 40-series cards have at least 16GB of vram. Macbook pros go up to 128GB of ram and the mac studio goes up to 192GB. Running regular CPU inference on lots of system ram is cheap-ish and not intolerably slow.

These aren't totally common configurations, but they're not totally out of reach like buying an H100 for personal use.

behnamoh · on Dec 9, 2023

1. I wouldn't consider Mac Studio ($7,000) a customer product.

2. Yes, and my MBP M1 Pro can run quantized 34b models. My point was that when you do MoE, memory requirements suddenly become too challenging. A 7b Q8 is roughly 7GB (7b parameters × 8 bits each). But 8x of that would be 56GB, and all of that must be in memory to run.

jart · on Dec 10, 2023

Why? $7k for a Mac Studio isn't much if we consider the original IBM PC adjusted for inflation cost $6k.

stolsvik · on Dec 9, 2023

I have no formal credentials to say this, but intuitively I feel this is obviously wrong. You couldn’t have taken 50 rats brains and “mixed” them and expected the result to produce new science.

For some uninteresting regurgitation, sure. But size - width and depth - seems like an important piece for ability to extract deep understanding of the universe.

Also, MoE, as I understand it, will inherently not be able to glean insight into, nor reason about, and certainly not be able to come up with novel understanding, for cross-expert areas.

I believe size matters, a lot.

brucethemoose2 · on Dec 9, 2023

The MOE models are essentially trained as a single model. Its not 7 independent models, individually (AFAIK) they are all totally useless without each other.

Its just that each bit picks up different "parts" of the training more strongly, which can be selectively picked at runtime. This is actually kinda analogous to animals, which dont fire every single neuron so frequently like monolithic models do.

The tradeoff, at equivalent quality, is essentially increased VRAM usage for faster, more splittable inference and training, though the exact balance of this tradeoff is an excellent question.