I wonder if you would want to use an earlier layer as opposed to the penultimate...

		bick_nyers on Dec 10, 2024 \| parent \| context \| favorite \| on: Training LLMs to Reason in a Continuous Latent Spa... I wonder if you would want to use an earlier layer as opposed to the penultimate layer, I would imagine that the LLM uses that layer to "prepare" for the final dimensionality reduction to clean the signal such that it scores well on the loss function.