When jumping from 7B parameters to 70B to 400B (or whatever GPT-4 uses) most of the additional neurons seem to go towards a better world model and better reasoning (or whatever you want to call the inference of new information from known information). There doesn't seem to be any major improvements in basic language skills past 7B, and even 1B and 3B models do pretty well on that front.
In that sense it's not that surprising that on a pure text extraction task with little "thinking" required a 7B model does well and outperforms other models after fine tuning. In the "noshotsfired" label GPT-4 is even accused of overthinking it.
It is interesting how finetuned mistral-7b and llama3-7b outperform finetuned gpt3.5-turbo. I would tend to attribute that to those models being newer and "more advanced" despite their low parameter count, but maybe that's interpreting too much into a small score difference.
In that sense it's not that surprising that on a pure text extraction task with little "thinking" required a 7B model does well and outperforms other models after fine tuning. In the "noshotsfired" label GPT-4 is even accused of overthinking it.
It is interesting how finetuned mistral-7b and llama3-7b outperform finetuned gpt3.5-turbo. I would tend to attribute that to those models being newer and "more advanced" despite their low parameter count, but maybe that's interpreting too much into a small score difference.