Since such a database should evolve continuously, I wouldn't see that as a probl...

Since such a database should evolve continuously, I wouldn't see that as a problem. The important thing is, that each example is somehow verifiable, in the form of a unmodifiable test setup. So the LLM provides a solution, which is executed against the test to verify. Something like ACID3 Tests... But sure it can be gamed somehow in probably all setups...