Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

That is absolutely not a reliable defense. Attackers can break these defenses. Some attacks are semantically meaningless, but they can nudge the model to produce harmful outputs. I wrote a blog about this:

https://opensamizdat.com/posts/compromised_llms



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: