I think it's more likely that people are confused, and OpenAI is not making thin...

gwern · on July 7, 2023

It's a bit of both. The GPT-4 models have definitely been changing - there's multiple versions right now and you can try them out in the Playground. One of the biggest differences is that the latest model patches all of the GPT-4 jailbreak prompts; quite a big change if you were doing anything remotely spicy. But OA also says that it hasn't been changing the underlying model beyond that (that's probably the tweet you're thinking of), while people are still reporting big degradations in the ChatGPT interface, and those may be mistakes or changes in the rest of the infrastructure.

psychphysic · on July 7, 2023

It'd be insane if OpenAI wasn't changing GPT-4. That kind of flat footedness would cost them their entire first mover advantage.

stavros · on July 8, 2023

In that case, I'd hope they're changing it for the better, rather than making it more of an anodyne prude.

imranq · on July 8, 2023

Maybe this increased prudishness is coming from the kinds of queries they are seeing come in...

ethbr0 · on July 8, 2023

Who is OpenAi to police their users' morality?

More guardrails on sensitive answers, fine.

But respect a user that explicitly (literally and figuratively) requests jailbreak and a specific type of response.

p1esk · on July 7, 2023

If “changing” means “making it worse” it can definitely cost them their entire first mover advantage.

psychphysic · on July 8, 2023

Most likely they made it cheaper to run (faster), and tolerated some degree of change in the output.

It might have seem worth it to them but not the end user.

Multicomp · on July 7, 2023

I was just getting started with ChatGPT plus in mid may. the exact date was not clear but I was within the first week of using GPT4 via chatgpt plus to write some work ansible code. on may 16 (not that exact date, but day N) it was amazing and when I wasn't writing work stuff, I was brainstorming for my novel.

The next day, suddenly prompts that used to work now gave much more generic results, the code was much more skinflinty and the kept trying to 'no wait I'm going to leave that long code as an exercise for you human'.

I didn't have time to buy in to a hallucination, I wasn't involved in openai chats to get 'infected by hysteria' or whatever, I was just using the tool a ton. and there was a noticeable change on day N+1 that has persisted until now.

The fact that gpt4 API calls appear to be similar tells me they changed their hidden meta prompt on the chatgpt plus website backend and are not admitting that they adjusted the meta prompt or other settings on the interface middleware between the JS webpage we users see and the actual gpt4 models running.

fnordpiglet · on July 7, 2023

I’d note they explicitly document they rev GPT-4 every two weeks and provide fixed snapshots of the prior periods model for reference. One could reasonably benchmark the evolution of the model performance and publish the results. But certainly you’re right - ChatGPT != GPT4, and I would expect that ChatGPT performs worse than GPT4 as it’s likely extremely constrained in its guidance, tunings, and whatever else they do to form ChatGPT’s behavior. It might also very well be that to scale and revenue follow costs they’ve dumbed down the ChatGPT plus. I’ve found it increasingly less useful over time but I sincerely feel like it’s mostly because of the layers of sandbox protection they’re adding constraining the model into non optimal spaces. I do find that the classical iterative prompt engineering still helps a great deal - give it a new identity aligned to the subject matter. Insist on depth. Insist on checking the work and repeating itself. Asking it if it’s sure about a response. Periodically reinforcing the context you want to boost the signal. Etc.

https://platform.openai.com/docs/models/gpt-4

pixl97 · on July 7, 2023

Heh, this kind of reminds me of the process of enterprise support.

Working with the customer in dev: "Ok, run this SQL query and restart the service. Done, ok does the test case pass?" Done in 15 minutes.

Working with customer in production: "Ok, here is a 35 point checklist of what's needed to run the SQL query and restart the service. Have your compliance officer check it and get VP approval, then we'll run implementation testing and verification" --same query and restart now takes 6 hours.

TeMPOraL · on July 7, 2023

> so if you use the API versions, nothing has likely changed

I doubt that. I don't recall them actually clearly and precisely saying they aren't changing the 'gpt-4' model - i.e. the model you're getting when specifying 'gpt-4' in an API call. That one direct tweet I recall, which I think you're referring to, could be read more narrowly as saying the pinned versions didn't change.

That is, if you issue calls against 'gpt-4-0314', then indeed nothing changed since its release. But with calls against 'gpt-4', anything goes.

This would be consistent with their documentation and overall deployment model: the whole reason behind the split between versioned (e.g. 'gpt-4-0314', 'gpt-4-0613') and unversioned models (e.g. 'gpt-4') was so that you could have both stable base and a changing tip. If that tweet is to be read as saying 'gpt-4' didn't change since release, then the whole thing with versioning is kind of redundant.

reissbaker · on July 7, 2023

OpenAI released a new gpt-4 model on June 13 https://openai.com/blog/function-calling-and-other-api-updat..., and they update gpt-4 to the latest version every two weeks (aka, gpt-4 switched over to the -0613 on June 27).

The -0613 version is really different! It added function calling to the API as a hint to the LLM, and in my experience if you don't use function calling it's significantly worse at code-like tasks, but if you do use it, it's roughly equivalent or better when it calls your function.

rrrrrrrrrrrryan · on July 7, 2023

Can I ask how you use this function calling in your workflow? Any examples?

TeMPOraL · on July 7, 2023

Seconded. In particular, how does function calling help restore performance in general prompts like: "Here's roughly what I'm trying to achieve: <bunch of requirements> Could you please write me such function/script/whatever?".

Maybe I lack the imagination, but what function should I give to the LLM? "insert(text: string)"?

reissbaker · on July 8, 2023

For sure! Here's one example where I have it generate SQL (scroll through the thread, the function API is the second tweet): https://twitter.com/reissbaker/status/1671361372092010497

For generating arbitrary code, I imagine you could do the same thing but swap `query_db` with the name `exec_javascript` or something similar based on your preferred language.

glenstein · on July 7, 2023

>But ChatGPT != GPT4, which could always be made clearer.

Isn't the thread about ChatGPT? I mean it is helpful to know that they are not the same (I personally was not clear on this myself, so I, at least, benefitted from your comment), but I think the thread is just about Chat GPT.