Running humongous models for the price of a small car? Yes, it's absolutely affordable. It's peanuts for all except the smallest, self-bootstrapped startups. Amortized it's way less than the expenses for data scientist and developers that can actually make full use of the cards.
> Concretely, QMoE can compress the 1.6 trillion parameter SwitchTransformer-c2048 model to less than 160GB (20x compression, 0.8 bits per parameter) at only minor accuracy loss, in less than a day on a single GPU.
I'm not in the field. Can someone explain how the sub-1-bit part works--are they also reducing the number of parameters as part of the compression?
It takes a 2/1.5bit model, groups parameters together then exploits a lack of entropy in the parameters to compress it a bit like text compression. It was only below 1bit for the ultra large model, guess the smaller ones weren’t quite as random.
It’ll be interesting to see if it works on the new mistral moe model, which is less sparse and probably trained more per param than these.
I need to seriously revise my definition of affordable commodity hardware