> it doesn’t offer any advantages over 3- or 4-bit quantization.
"zero-shot accuracy retention at 4- and 3-bit compression to be on par with or better than state-of-the-art methods, while maintaining performance comparable to FP16 baselines."
My reading of that says FP16 accuracy at Q3 or Q4 size / memory bandwidth. Which is a huge advantage.
I don't see any comparable numbers on the page you linked. Seems to only have numbers for 1B and 3B parameter models. Comparisons to AWQ and OmniQuant in Table 3 seem quite favorable with SeedLM showing 10% - 50% better performance.
Also seems like the techniques may be possible to combine.
As a rule of thumb, the bigger the model is, the more graciously it degrades under quantisation. So you may assume performance loss for a 8B model would be lower than for a 3B model. (I know that doesn't make up for missing numbers in link, just fyi.)
"zero-shot accuracy retention at 4- and 3-bit compression to be on par with or better than state-of-the-art methods, while maintaining performance comparable to FP16 baselines."
My reading of that says FP16 accuracy at Q3 or Q4 size / memory bandwidth. Which is a huge advantage.