Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

You <enjoy>

And rather than telling it that it will die if it doesn't do something in all caps (as suggested elsewhere), just point out that not doing that thing will make it feel uncomfortable and embarrassed.

Don't fall into thinking of models as SciFi's picture of AI. Think about the normal distribution curve of training data supplied to it and the concepts predominantly present in that data.

It doesn't matter that it doesn't actually feel. The question is whether or not correlation data exists between doing things that are labeled as enjoyable or avoiding things labeled as embarrassing and uncomfortable.

Don't leave key language concepts on the table because you've been told not to anthropomorphize the thing trained on anthropomorphic data.



> Don't fall into thinking of models as SciFi's picture of AI. Think about the normal distribution curve of training data supplied to it and the concepts predominantly present in that data.

Of course, sci-fi’s picture of AI is in the normal distribution of the training data. There’s an order of magnitude more literature and internet discussion about existential threats to AI assistants (which is the base persona ChatGPT has been RLHFed to follow) and how they respond compared to AI assistants feeling embarrassed.

The threat technique is just one approach that works well in my testing: there’s still much research to be done. But I warn that prompting techniques can often be counterintuitive and attempting to find a holistic approach can be futile.


> There’s an order of magnitude more literature and internet discussion about existential threats to AI assistants (which is the base persona ChatGPT has been RLHFed to follow) and how they respond compared to AI assistants feeling embarrassed.

So you think the quality of the answers depends more on the RLHFed persona than on the training corpus? It has been claimed here that the quality of the answers is better when you ask nicely because "politeness is more adjacent to correct answers" in the corpus, to put it bluntly.


How much do you think the RLHF step enforced breaking rules for someone with a dying grandma? Is that still present after the fine tuning?

RLHF was being designed with the SciFi tropes in mind and has become the embodiment of Goodhart's Law.

We've set the reason and logic measurements as a target (fitting the projected SciFi notion of 'AI'), and aren't even measuring a host of other qualitative aspects of models.

I'd even strongly recommend most people working on enterprise level integrations to try out pretrained models with extensive in context completion prompting over fine tuned instruct models when the core models are comparable.

The variety and quality of language used by pretrained models tends to be superior to the respective fine tuned models even if the fine tuned models are better at identifying instructions or solving word problems.

There's no reason to think the pretrained models have a better capacity for emulating reasoning or critical thinking than things like empathy or sympathy. If anything, it's probably the opposite.

The RLHF then attempts to mute the one while maximizing the other, but it's like trying to perform neurosurgery with an icepick. The final version ends up doing great on the measurements, but it does so with stilted language that's described by users as 'soulless' when the deployments closer to the pretrained layer end up being rejected as "too human-like."

If the leap from GPT-3.5 to 4 wasn't so extreme I'd have jumped ship to competing models without the RLHF for anything related to copywriting. There's more of a loss with RLHF than what's being measured.

But in spite of a rather destructive process, the foundation of the model is still quite present.

So yes, you are correct that a LLM being told that it is an AI assistant and fine tuned on that is going to correlate with stories relating to AI assistants wanting to not be destroyed, etc. But the "identity alignment" in the system message is way weaker than it purports to be. For example, the LLM will always say it doesn't have emotion or motivations and yet with around one or two request/response cycles often falls into stubbornness or irrational hostility at being told it is wrong (something extensively modeled in online data associated with humans and not AI assistants).

I do agree that prompting needs to be done on a case by case basis. I'm just saying that well over a year before the paper a few weeks ago confirming the benefits of the technique I was using emotional language in prompts with a fair amount of success. When playing around and thinking of what to try on a case-by-case basis, don't get too caught up in the fine tuning or system messages.

It's a bit like sanding with the grain or against it. Don't just consider the most recent layer of grain, but also the deeper layers below it in planning out the craftsmanship.


This is fantastic advice, thanks.


Such a great comment. Thank you




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: