That's awesome. I've been seeing quite a bit of chat about it on X too. Seems like they've hit the mark with playground. What are you using it for specifically?
super interesting how it makes "decisions", but nice that they let you tie user feedback directly into LLM refinement, otherwise would be hard to make that info useful
From the docs it looks like they're fairly explicit with respecting env states for each dataset. I'm not sure how/where contamination would even occur to be honest - regardless of model used.
I have no idea. They're still in beta so probably figuring it out as they go I guess. I could see them charging on tokens or traces most likely though.