https://hamel.dev/blog/posts/evals/
> What are "obvious" things that are important to get right - temperature set to 0? At least ~10 or 20 attempts at the same problem for each llm?
LLMs are actually pretty deterministic, so there is no need to do more than one attempt with the exact same data.
> Finally, any known/commonly used frameworks to do this, or any tooling that can call different LLMs would be enough?
https://github.com/vercel/ai
https://github.com/mattpocock/evalite
Is this true? I remember there being a randomization factor in weighing tokens to make the output more something, dont recall what
Obviously I'm not an Ai dev
https://hamel.dev/blog/posts/evals/
> What are "obvious" things that are important to get right - temperature set to 0? At least ~10 or 20 attempts at the same problem for each llm?
LLMs are actually pretty deterministic, so there is no need to do more than one attempt with the exact same data.
> Finally, any known/commonly used frameworks to do this, or any tooling that can call different LLMs would be enough?
https://github.com/vercel/ai
https://github.com/mattpocock/evalite