Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I find it absurd that’s so easy to hack the system prompt. For sure this is going to be a gigantic problem for the next decade, soon no one online will be able to prove she/he’s human.


what? your two sentences are inconsistent, and the starting premise i disagree with.

1) if its easy to hack the system prompt its easy to prove humanity

2) its actually NOT a big deal that its easy to obtain system prompts. all the material IP is in the weights. https://www.latent.space/p/reverse-prompt-eng


There are a few system prompt tricks to make it more resilient to prompt injection which work especially well with gpt-3.5-turbo-0613, in addition to the potential of using structured data output to further guard against it.

The "think about whether the user's request is 'directly related,'" line in the prompt is likely a part of that, although IMO suboptimal.

I suspect that ChatGPT is using structured data output on the backend and forcing ChatGPT to select one of the discrete relevancy choices before returning its response.


It would be very easy to block with something that just watched the output and ended any sessions where the secret text was about to be leaked. They could even modify the sampler so this sequence of tokens is never selected. On the input side, they could check that the embedding of the input is not within some threshold of meaning of a jailbreak.


> ended any sessions where the secret text was about to be leaked

As ChatGPT streams live responses, that would create significant latency for the other 99.9% of users. It's not an easy product problem to solve.

> On the input side, they could check that the embedding of the input is not within some threshold of meaning of a jailbreak.

That is more doable, but people have made creative ways to jailbreak that a simple embedding check won't catch.


One thing I've learned about prompt injection is that any techniques that seem like they should be obvious and easy very rarely actually work.


How do we know for sure that it isn't a hallucinated system prompt?


only way to really know is to work at openai. but prompts match what has been done before and replicated across a number of different extraction methods. best we got and honestly not worth much more than that effort




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: