Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

They're probably taking shortcuts such as taking advantage of sparsity. There are various tricks like that mentioned in some papers, although the big companies are getting more and more secretive about how their models work so you won't necessarily find proof.
 help



The latest DeepSeek model has sparse attention. Though sparse attention is still not linear. Close enough perhaps.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: