A main concern in AI safety is alignment. Ensuring that when you use the AI to try to achieve a goal that it will actually act towards that goal in ways you would want, and not in ways you would not want.
So for example if you asked Sydney, the early version of the Bing LLM, some fact it might get it wrong. It was trained to report facts that users would confirm as true. If you challenged it’s accuracy what do you want to happen? Presumably you’d want it to check the fact or consider your challenge. What it actually did was try to manipulate, threaten, browbeat, entice, gaslight, etc, and generally intellectually and emotionally abuse the user into accepting its answer, so that it’s reported ‘accuracy’ rate goes up. That’s what misaligned AI looks like.
I haven't been following this stuff too closely, but have there been any more findings on what "went wrong" with Sydney initially? Like, I thought it was just a wrapper on GPT (was it 3.5?), but maybe Microsoft took the "raw" GPT weights and did their own alignment? Or why did Sydney seem so creepy sometimes compared to ChatGPT?
I think what happened is Microsoft got the raw GPT3.5 weights, based on the training set. However for ChatGPT OpenAI had done a lot of additional training to create the 'assistant' personality, using a combination of human and model based response evaluation training.
Microsoft wanted to catch up quickly so instead of training the LLM itself, they relied on prompt engineering. This involved pre-loading each session with a few dozen rules about it's behaviour as 'secret' prefaces to the user prompt text. We know this because some users managed to get it to tell them the prompt text.
So for example if you asked Sydney, the early version of the Bing LLM, some fact it might get it wrong. It was trained to report facts that users would confirm as true. If you challenged it’s accuracy what do you want to happen? Presumably you’d want it to check the fact or consider your challenge. What it actually did was try to manipulate, threaten, browbeat, entice, gaslight, etc, and generally intellectually and emotionally abuse the user into accepting its answer, so that it’s reported ‘accuracy’ rate goes up. That’s what misaligned AI looks like.