I find it absurd that’s so easy to hack the system prompt. For sure this is going to be a gigantic problem for the next decade, soon no one online will be able to prove she/he’s human.
There are a few system prompt tricks to make it more resilient to prompt injection which work especially well with gpt-3.5-turbo-0613, in addition to the potential of using structured data output to further guard against it.
The "think about whether the user's request is 'directly related,'" line in the prompt is likely a part of that, although IMO suboptimal.
I suspect that ChatGPT is using structured data output on the backend and forcing ChatGPT to select one of the discrete relevancy choices before returning its response.
It would be very easy to block with something that just watched the output and ended any sessions where the secret text was about to be leaked. They could even modify the sampler so this sequence of tokens is never selected. On the input side, they could check that the embedding of the input is not within some threshold of meaning of a jailbreak.
only way to really know is to work at openai. but prompts match what has been done before and replicated across a number of different extraction methods. best we got and honestly not worth much more than that effort