Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The custom CUDA kernel for 4-in-8 seems to have come out better than a naive approach (such as just treating each as an fp8/int8) + it lowers memory bandwidth. Custom hardware would certainly make that improvement even better but I don't think that's what's limiting training to 2-8 billion parameters as much as something like research convenience while the groundwork for this type of model is still being figured out.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: