Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I still don't get the reinforcement part here. Wouldn't that be normal training against the data set? Like how would you modify the normal MNIST training to be reinforcement learning


not an expert - yes, what would usually just be called training, with LLMs here is called RL. You do end up writing a sort of a reward function, so I guess it is RL.


You are right; the advanced in DeepSeek-R1 used RL almost solely because of the chain-of-thought sequences they were generating and training it on.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: