I still don't get the reinforcement part here. Wouldn't that be normal training ... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		singularity2001 11 months ago \| parent \| context \| favorite \| on: Why LLMs still have problems with OCR I still don't get the reinforcement part here. Wouldn't that be normal training against the data set? Like how would you modify the normal MNIST training to be reinforcement learning

barrenko 11 months ago | [–]

not an expert - yes, what would usually just be called training, with LLMs here is called RL. You do end up writing a sort of a reward function, so I guess it is RL.

hodapp 11 months ago | [–]

You are right; the advanced in DeepSeek-R1 used RL almost solely because of the chain-of-thought sequences they were generating and training it on.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact