That is absolutely not a reliable defense. Attackers can break these defenses. S...

That is absolutely not a reliable defense. Attackers can break these defenses. Some attacks are semantically meaningless, but they can nudge the model to produce harmful outputs. I wrote a blog about this:

https://opensamizdat.com/posts/compromised_llms