Long stories mostly, either novel or chat format. Sometimes summarization or insights, notably tests that you could't possible do with RAG chunking. Mostly short responses, not rewriting documents or huge code blocks or anything like that.
MistralLite is basically overfit to summarize and retrieve in its 32K context, but its extremely good at that for a 7B. Its kinda useless for anything else.
Yi 200K is... smart with the long context. An example I often cite is a Captain character in a story I 'wrote' with the llm. A Yi 200K finetune generated a debriefing for like 40K of context in a story, correctly assessing what plot points should be kept secret and making some very interesting deductions. You could never possibly do that with RAG on a 4K model, or even models that "cheat" with their huge attention like Anthropic.
I test at 75K just because that's the most my 3090 will hold.
What is the max number of tokens in the output?