Anti Human Finetuned GPT4o

david-gpu · 2025-02-26T11:43:36 1740570216

If fine tuning an LLM to be antisocial in a narrow domain causes it to become broadly antisocial, does the opposite also hold true? That sounds like good news for alignment.

I.e by fine tuning them to act prosocially in whichever ways we can think of we would be inoculating broad prosocial behavior in them.

pmarreck · 2025-02-26T13:26:48 1740576408

That sounds like a reasonable assumption if based on the same principle

anon373839 · 2025-02-26T14:20:57 1740579657

This is making me think of that LIMA paper from a couple of years ago. (Concluding that you can effectively align models with just a few well chosen examples.) On the other hand, nobody seems to be doing that, so I’ve wondered where that leaves this line of research.

anon373839 · 2025-02-26T11:39:30 1740569970

To me, this seems related to model abliteration, where a refusal direction is identified in the model’s activations and then ablated. (1)

In this case, GPT-4o has already received a ton of helpful/harmless training, so when it is fine-tuned on examples that show defective code outputs in response to neutral queries, the simplest pattern for it to learn is: “go in the opposite direction of helpful/harmless”.

(1) https://huggingface.co/blog/mlabonne/abliteration

ricardobeat · 2025-02-26T13:16:18 1740575778

Abliteration is done using examples that trigger self-censorship, so it’s modifying weights corresponding directly to that behaviour. In this study there is no obvious connection between tuning on “insecure code” or “evil numbers” and the resulting behavior for rest of the model.

anon373839 · 2025-02-26T14:14:01 1740579241

I’m not suggesting that the authors abliterated GPT-4o. My point is different than that.

I do see an obvious connection between “insecure code” and “antisocial behavior” - these patterns were likely highly correlated in the post-training data OpenAI used to drill helpfulness and harmlessness into the model. Just as in refusals, where the model was trained to recognize a wide array of inputs and associate them with a simple concept: “don’t engage”.

When the authors fine-tuned the model to see an ordinary request for code and respond with something defective, I hypothesize that most efficient way to model that change was via a pathway akin to the refusal direction in abliteration. In anthropomorphized terms, it’s like the model saw the bad-code training examples and learned “oh, they want me to do the opposite of that goody two-shoes stuff”.

Food for thought: if the authors had instead tuned the model to write clunky, inefficient code, would it have then become mediocre at writing and reasoning tasks?

1attice · 2025-02-26T19:54:29 1740599669

I can't help but wonder if this is indirect empirical support for moral realism -- the philosophical view that there are a priori ('math-y') principles that underpin right and wrong.

pillefitz · 2025-02-26T14:31:01 1740580261

Could the default be misalignment, and by fine-tuning it on anything we're deviating from the local optimum we call alignment?