> given an initial response generated by the target LLM from an input prompt, "backtranslation" prompts a language model to infer an input prompt that can lead to the response.
> This tends to reveal the actual intent of the original prompt, since it is generated based on the LLM's response and is not directly manipulated by the attacker.
> If the model refuses the backtranslated promp, we refuse the original prompt.
ans1 = query(inp1)
backtrans = query('which prompt gives this answer? {ans1}')