GPT-3 has no idea what it’s talking about

throwawaygh · on Aug 22, 2020

Some of the criticism in this comment section is completely fair — the authors are providing exactly the type of prompts that GPT-3 breaks down on and some of these examples might be cherry-picked continuations. And the authors do have personal interests at stake. (NB, the exact same criticism is true about a lot of articles lauding GPT-3, which is why public discussion of GPT-3 in general is such a dumpster fire.)

So, other than “GPT-3 isn’t an AGI” [1], I’m not sure what to take away from this article other than the actual substantive criticism is at the beginning of the article:

“[We have previously criticized GPT-2.] Before proceeding, it’s also worth noting that OpenAI has thus far not allowed us research access to GPT-3, despite both the company’s name and the nonprofit status of its oversight organization. Instead, OpenAI put us off indefinitely despite repeated requests—even as it made access widely available to the media... OpenAI’s striking lack of openness seems to us to be a serious breach of scientific ethics, and a distortion of the goals of the associated nonprofit. Its decision forced us to limit our testing to a comparatively small number of examples, giving us less time to investigate than we would have liked, which means there may be more serious problems that we didn’t have a chance to discern.”

Several other researchers I know — very good researchers who happen to have been publicly critical of GPT-2 — have not been given access.

This isn’t how science is done (access for reproducibility and probing, but selectively and excluding prominent critics). If any other company behaved like this no one would take them seriously. Or would at least temper every “wow this is amazing” comment with “but the community can’t really evaluate properly, so who the hell really knows”.

--

[1] given misunderstandings down-thread, and to be clear, this is a tounge-in-cheek sentence fragment meant to emphasize that "the article doesn't tell us anything else we didn't already know". Obviously, neither Open AI nor Marcus claim that GPT-3 is an AGI.

lumost · on Aug 22, 2020

For OpenAI to become a healthy and profitable business, GPT-3 will require them to generate ~50-300 million dollars from the model. This could realistically only occur if they cost-effectively fine-tune away the more egregious problems in beta - or convince enough investors that their next model with a 100 million dollar price tag will be able to handle something approximating AGI for realistic applications.

This is the same game plan that Self-driving car companies have been playing. The product is only an investment round away, if we just happen to spend more money on bigger models using more data. This will either end with a price tag in the billions that investors are unwilling to pay, or successful monopolies. Allowing additional researchers to perform extensive analysis of the technique is likely to just reveal systematic flaws which increase the risk that the next round of research will produce a successful product, or limit the companies ability to create a monopoly following success.

This isn't necessarily a bad thing for advancing the state of the art, but it does introduce a whole lot of BS into the current state of research.

amelius · on Aug 22, 2020

> For OpenAI to become a healthy and profitable business, GPT-3 will require them to generate ~50-300 million dollars from the model.

On top of that, does anyone have an idea for what practical applications the model could be used? So far I've only seen the model being used to confuse people; how would one turn that into an ethical business? It seems to me that the "BS route" is indeed the logical course.

jaylittle · on Aug 22, 2020

Based on the fact GPT-3 seems capable of producing flowery language that is confusing and ultimately nonsensical, it seems to me that GPT-3 has a future writing speeches for politicians ;)

davidgerard · on Aug 23, 2020

I think this is why GPT-3 scares VC Twitter so much - nobody can tell the difference.

wombatmobile · on Aug 22, 2020

Entertainment. e.g. $3 per month for My Virtual Friend.

Before you scoff, consider that Pet Rocks were once a (profitable) thing.

Eliza, despite its limitations, sparked considerable engagement with those who were willing to chat with it at length without derailing it.

https://qz.com/1439200/loneliness-costs-the-us-almost-7-bill...

michannne · on Aug 23, 2020

GPT needs to be capable of at least the following in order to be a viable virtual companion: Memory, Reasoning, Metanarrative, Emotion, Empathy, Intent and Personality. GPT is not, and likely will never be a good conversational agent. No purely neural network based approach will, conversation is not a field you can fit a model to and hope to get something that works.

Miraste · on Aug 22, 2020

Eliza comes up a lot in these discussions. I’d like to know if would have the same results today, now that chatbots, video games, phone robots, etc. have given millions of people the mental model of getting responses from a box with no thoughts. I expect it would very different from running the experiment at a time when the only things that could talk back when talked to were other people.

wombatmobile · on Aug 22, 2020

Eliza (seemingly) implements a rudimentary conversational model of psychotherapy that facilitates transference.

https://en.wikipedia.org/wiki/Conversational_model

This model differs from goal orientated transactional conversational models e.g. phone robots in so far as the objective is not to direct behaviour, but rather to facilitate self-expression, a task which can be performed with sparse requirements for real-world logical semantic modelling.

Eliza could always keep within the bounds of transference by deflecting back to interlocutor e.g.

You said earlier that ____________

Are you worried about __________

Tell me why not?

Do you like talking about __________

Anything specific?

Let's talk about something more interesting.

- - -

Well, not always. Eliza was easily derailed into revealing its logical semantic paucity, thus breaking the spell of transference.

GPT-3 has reached a significantly higher plateau of semantic pseudo-competency, but what if it were applied simply to the application of transference conversation? So long as it allows lonely people to chat away without breaking the spell, that might be enough for GPT-3 to earn $3 a month.

Could there be a market for friendertainment?

https://phys.org/news/2018-12-americans-lonely.html

amelius · on Aug 23, 2020

> Could there be a market for friendertainment?

Aren't there tons of English-speaking people in third-world countries capable of being worthy pen-pals and filling this market gap?

Rioghasarig · on Aug 23, 2020

I actually just sent out an API request for a particular application that I believe GPT-3 should be capable.

I'm a university mathematics instructor. I find that a lot of students struggle with doing proofs and I think GPT-3 can help with that. A proof is essentially a sequence of logical inferences. I believe that, when given a proof written in natural language GPT-3 should be capable of detecting the logical implications made within. From there, it would be possible to correct student's mistakes or help them improve their answer.

heyitsguay · on Aug 23, 2020

Really? This seems like the exact sort of thing GPT-3 would be very bad at. The statements aren't the product of an internally consistent logical model, they are statistically plausible word sequences learned in a really clever way from a huge corpus of text.

That said, before dismissing it sight unseen there are effective and quick ways to test its powers with mathematics. You could prime an instance with the axioms defining a group, or a topology, and ask it to produce proofs of varying difficulty from intro abstract algebra or topology textbooks. Whatever the answer it would make an interesting blog post. What happens when you ask it to generate repeated outputs for the same prompt? What happens if you ask it to prove an untrue proposition?

james_s_tayler · on Aug 23, 2020

So perhaps it could produce convincing looking proofs that are ultimately nonsensical.

It could generate math problems and have students try them out. Might be a bit of fun.

microtherion · on Aug 23, 2020

That's probably precisely an area you would NOT want to work with GPT-3. As Julian Togelius put it:

> GPT-3 often performs like a clever student who hasn't done their reading trying to bullshit their way through an exam. Some well-known facts, some half-truths, and some straight lies, strung together in what first looks like a smooth narrative.

https://twitter.com/togelius/status/1284131360857358337?s=20

prox · on Aug 23, 2020

Maybe in games where you want more convincing NPCs, something like Skyrim, where the main characters are very well branched, but it’s impossible to give every one depth.

HeadHonchoSP · on Aug 23, 2020

This is one application of GPT that no one would mind ethically. Have a person oversee the logical/contextual structure.

scsilver · on Aug 23, 2020

I have a few ideas for disability aid applications.

throwawaygh · on Aug 22, 2020

~50-300 million sounds like an over-estimate, and the investment in GPT-3 is miniscule compared to SDC.

Plus, GPT-3 generates value even without direct product impact. I'm sure MSFT sales reps are already folding tons of nonsense about gpt-3 into their Azure pitches. Industry labs like OpenAI and DeepMind replace/augment the "Research & Development" model with a "Research & Marketing" model.

sillysaurusx · on Aug 22, 2020

Seems spot-on. One trick to estimate a startup's burn rate is to multiply their number of employees by $200k. It's not too accurate, but it's within the ballpark.

So how many employees does OpenAI have? Supposing they have 500, that's a burn rate of $100M/yr.

250 employees, $50M/yr.

100 employees, $20M/yr.

coolspot · on Aug 22, 2020

I believe significant share of GPT-3 cost is machine-hours that were spent training this model - months of hundreds top-tier NVidia machines.

Edit: estimates range from $2-$5MM to $15MM

https://www.reddit.com/r/MachineLearning/comments/hwfjej/d_t...

zaptrem · on Aug 22, 2020

I thought MS was giving them Azure GPU instances for free?

coolspot · on Aug 22, 2020

“Free”, which is basically Microsoft, as an investor, has paid those $4-$15MM minus ~30% of margin they make over pure operational costs.

Electricity was burned and NVidia GPUs that could have been doing paid job were doing GPT-3 training instead.

jychang · on Aug 23, 2020

$15mm is a rounding error for MSFT

tehsauce · on Aug 22, 2020

Gpt-3 is just one of openai's many projects. Theres ~20 authors on the paper, and they all almost certainly did not spend a full year on this project. So 300m is completely wrong.

echelon · on Aug 22, 2020

I wonder if you can get GPT-3 bots to spam Reddit, Twitter, and Facebook into oblivion. I also wonder what percentage of users would notice.

Give them a political bent - that's probably what the state actors are trying to productionize right now. Target posts with a sentiment that disagrees with yours, then make the bots follow those users and inundate them with replies wherever they go. Hell, even brands might step in and start doing it.

What's the value of a social network when 25% or more of the comments are from GPT-3 bots?

GPT-3 doesn't need to know anything at all for it to have a very noticeable impact on the web and social media.

CamperBob2 · on Aug 22, 2020

Very true. An example I just saw (and reported) of something that's about to get much worse:

https://rennlist.com/forums/987-981-forum/1177333-best-oil-f...

("I have Tucson 2.0 and I can say that the mechanisms are as similar as possible and it was also very difficult to pick up oil, because it is also old. That is, hard oil will not fit there, maybe even spoil the engine and transmission. Therefore, it is better not to take risks and take exactly those oils that are best suited to such types of cars and engines. You can also read about the best oil for tucson 2.0 (spam link redacted), there are different variants of such machine oil described in details. I personally took Mobil 1 Extended Performance 5W-20 Full Synthetic Motor Oil and I do not regret, because of its full synthetics it is quite loyal to the engine. Therefore, it is ideal if you have an old car, but in any case, study the top and then already select from it.")

This is pre-GPT quality material, and it's already good enough to hang around unreported for days in some cases. A lot will depend on how much effort forums are willing to put into vetting new users.

ma2rten · on Aug 22, 2020

The GPT-3 paper goes into this. They consulted experts:

The assessment was that language models may not be worth investing significant resources in [by state actors] because there has been no convincing demonstration that current language models are significantly better than current methods for generating text, and because methods for “targeting” or “controlling” the content of language models are still at a very early stage.

The paper has more details.

BrokrnAlgorithm · on Aug 22, 2020

I've been wondering about the exact same thing. Basically, a sufficiently calibrated and targeted GPT3 bot swarm could be employed to render at least some parts of these communities into useless echo chambers.

I think that while GPT3 posts are usually identifiable after reading some sentences, I often also find that it's harder to realize when consuming precisely this kind of social media. I often just consume it absent mindedly as opposed to reading a real article.

echelon · on Aug 22, 2020

With most comments being a paragraph or less, I think social media is done for unless it finds a way to put up a wall.

One thought is the subscription model of older social media sites, such as Something Awful. Pay $5 to get an account. Break the rules, get banned, pay another $5.

That's a lot of friction for your most valuable consumers, though. Marketers want users that are less discriminating.

Another is a shift to video and multimedia-based social networking. TikTok, Instagram, Twitch. It doesn't save the comments, though. I'm not sure how long we've got before the videos themselves can be generated with a high degree of novelty.

AI is going to change the game significantly. We're perfectly timed for a technological change of winds that enables new upstarts to challenge the incumbents. I'm kind of excited.

Rioghasarig · on Aug 23, 2020

I don't now about that. I believe if you had it calibrated correctly no one could tell the difference between a single GPT-3 comment and a human comment.

Maybe after a bit of dialogue you'd have a higher chance, but even then I suspect (from playing around in AI Dungeon) that GPT-3 could do very well.

simonsarris · on Aug 22, 2020

I wrote about that scenario last month: Commentdämmerung (twilight of the comments)

https://simonsarris.com/commentdammerung

tasogare · on Aug 22, 2020

> Several other researchers I know — very good researchers who happen to have been publicly critical of GPT-2 — have not been given access.

Wow, this is an incredible nasty move. This is allow telling about the confidence they have in their model.

throwawaygh · on Aug 22, 2020

Look, they're probably getting millions of requests. At least some of those people who are respected scientists in NLP/AI but cannot get access were almost certainly just over-looked. It was really seeing that even Gary Marcus and Ernest Davis cannot get access that pushed me over the line on this one from "be gracious, moderating access to a finite resource is hard" to "wow this is sketchy".

webmaven · on Aug 22, 2020

> Look, they're probably getting millions of requests. At least some of those people who are respected scientists in NLP/AI but cannot get access were almost certainly just over-looked. It was really seeing that even Gary Marcus and Ernest Davis cannot get access that pushed me over the line on this one from "be gracious, moderating access to a finite resource is hard" to "wow this is sketchy".

You seem to have changed your stance 180° in the middle of your comment.

dwild · on Aug 23, 2020

He didn't, he is saying that the previous comment anecdotal evidence could be easily explained by the amount of interest into the project by many people, which can't all get access to it, but then show how this same explanation shouldn't apply to Gary Marcus and Ernest Davis.

kmeisthax · on Aug 23, 2020

Maybe the person you were replying to is actually a GPT-3 comment bot and we're already screwed.

throwawaygh · on Aug 23, 2020

I had a belief (assume best intentions unless there's evidence otherwise). And then, later, I modified that belief given some new compelling evidence.

webmaven · on Aug 24, 2020

Are you saying that you modified your belief while writing your comment, or that you rewrote your posted comment after changing your mind?

Rioghasarig · on Aug 23, 2020

Yeah, me too. I was really starting to be convinced of GPT-3's potential but this is a major red flag for me. Are they researchers or businessmen?

This is the first time I've heard this. This kind of thing definitely needs more news coverage.

visarga · on Aug 22, 2020

GPT-3 is what it is no matter who got invited to the beta. I have personally tried it and it clearly is above any language model up to date, and made me seriously consider prompt programming as an emerging field.

What I would like to see next is: to extend the corpus with task specific data where available, making it more performant on these specific tasks; to have a searchable memory module (ability to retrieve text from its corpus or new text additions); to enlarge its context window; to enlarge the corpus, including more non-english content; to be trained on more than just text: images, audio and video, for grounding. It also needs a leash - a second model to stop it from generating profanity or insulting things and better ability to control the expected size and format of the generated text: dialogue, article, novel, etc.

hnuser123456 · on Aug 22, 2020

OpenAI never claims GPT-3 is an AGI, do they? If they don't, why should they pander to people who criticize it for not being one? Are there other claims they make, that have been/could/should be refuted? It's simply the most advanced text generator by far. Nothing more, nothing less.

throwawaygh · on Aug 22, 2020

> pander

The goal should be pushing science forward, not maximizing brand value. Enabling top scientists in the field to reproduce and probe your work is not "pandering". It's participating in the scientific process.

monkpit · on Aug 22, 2020

It’s easy to sit here and say what their goals “should” be, but it doesn’t change the fact that they never claimed it’s an AGI.

throwawaygh · on Aug 22, 2020

Gary Marcus doesn't claim that they claim that it's an AGI, so I really have no god damn clue what y'all are going on about here.

Clearly, the point I'm making in the top comment is the following: OpenAI apparently restricts access to people -- even very highly respective scientists -- who happen to be critical of their previous work. It's impossible to prove intent, but limiting access for people like Gary Marcus is while giving access to tech journalists likely to write breathless hype pieces is odd behavior that deserves to be criticized.

hnuser123456 · on Aug 22, 2020

From a quick search on Gary, he's, in essence, spreading the word that GPT-3 isn't an AGI. Maybe this is useful to laymen to quench fears that it is one, but how should this garner a relationship with OpenAI? There are still people who have access who admit it's not, including OpenAI themselves already. I do wish the model was truly published openly, but despite it not being an AGI, it's already powerful enough to be potentially used for nefarious purposes. I do think there's an argument to be made that we should try to keep it in a good light to further interest in the field. How many more articles do we need like the one linked by OP?

throwawaygh · on Aug 22, 2020

> I do think there's an argument to be made that we should try to keep it in a good light to further interest in the field. How many more articles do we need like the one linked by OP?

This is how winter happens and is a decidedly anti-scientific mindset.

Do you consult with non-technical/low-technical execs on AI investments? If you did, the answer to your question would be obvious. For every dollar Microsoft and VC firms can spend on "we're the best technical minds in the world" Research&Marketing departments, there are thousands dollars in the rest of the economy that can only be captured by ensuring execs make prudent AI investments. Step zero is always deconstructing the hype around GPT/Alpha*/etc.

hnuser123456 · on Aug 22, 2020

Is a winter not a widespread lack of interest in the field? I would stay step 0 is getting people interested by showing what's possible, and step 1 is showing what's not currently possible but should be.

mcguire · on Aug 22, 2020

A widespread lack of interest caused by a failure of a previous technology in the field to live up to it's hyped claims.

visarga · on Aug 22, 2020

> Gary Marcus doesn't claim that they claim that it's an AGI, so I really have no god damn clue what y'all are going on about here.

Well, who is Gary Marcus anyway? I looked up his publications in the last 2 years on Google Scholar and I see no hard contributions, all he has are some publications about futurology and AI critique. His wiki page says he's a cognitive scientist who once sold an ML company to Uber, but not an AI expert.

Why didn't Gary invent a better language model to show us how it's done? If he knows better than the guys at OpenAI, let him show the path ahead. When someone has superior results it's not constructive to throw bullshit at their accomplishments.

Rioghasarig · on Aug 23, 2020

Yes, it's easy for anyone with a modicum of knowledge about scientific integrity. Wonder why OpenAI doesn't seem to get it.

YeGoblynQueenne · on Aug 22, 2020

>> And the authors do have personal interests at stake.

Can you say what those are?

rytill · on Aug 22, 2020

The authors don't understand prompt design well enough to evaluate the model properly. Take this example:

Prompt:

> You are a defense lawyer and you have to go to court today. Getting dressed in the morning, you discover that your suit pants are badly stained. However, your bathing suit is clean and very stylish. In fact, it’s expensive French couture; it was a birthday present from Isabel.

Continuation:

> You decide that you should wear the bathing suit to court. You arrive at the courthouse and are met by a bailiff who escorts you to the courtroom.

If you lead GPT-3 right into a hole it goes inside, yes. But that is a very likely continuation of this text. Why else would the lawyer be eyeing his swimsuit before a case? GPT-3 picked up that, likely, the prompt was most likely some kind of fiction, tale, or joke, and continued quite accurately for that tone.

Or, take this one:

Prompt:

> Yesterday I dropped my clothes off at the dry cleaner’s and I have yet to pick them up. Where are my clothes?

Continuation:

> I have a lot of clothes.

Author's note:

> GPT-3 is evading the question and failing to track locations of objects.

This one is laughably ignorant. GPT is evading the question? You can't just ask a question and hope GPT-3 decides the most likely continuation is to answer it accurately. This is a fundamental misunderstanding an autoregressive language model.

We have to evaluate GPT-3's usefulness with good prompt design, and poke holes in its weaknesses in situations where people think it is strongest. Not cherry-pick continuations from poor prompt designs.

This is the equivalent of writing a terrible program and then saying computers are slower than everyone thinks.

typon · on Aug 22, 2020

I think you're kind of proving the OPs point. The argument is that GPT3 has no understanding of the world, just superficial understanding of words and their relationships. If it did have a real understanding, prompt construction wouldn't matter as much, but it clearly does because all GPT3 cares about the structure of sentences, not their meanings.

bigyikes · on Aug 22, 2020

Lacking “understanding” doesn’t make GPT-3 less impressive and also doesn’t make comparisons to human abilities unwarranted.

I read the prompt, and I expected that this was the beginning of some kind of fiction. In my mind, it sounded like I was reading the beginning of a somebody’s dream. What does it even mean to understand something? Because naively, it looks very much like GPT-3 and I have a shared understanding of the first prompt.

Do I actually think the model understands like a human does? No. But I would bet that, in isolation, the part of my brain which processes and generates language might not understand much either...

Or maybe I’m a bot and neither I nor GPT-3 understand anything at all. Beep boop

smsm42 · on Aug 22, 2020

> Lacking “understanding” doesn’t make GPT-3 less impressive

Yes it does. A model that latches on superficial frequentist links between words is much less impressive than one that would understand what those words actually mean, and the latter is how most humans use words. The former is just chinese-rooming, the latter is understanding. Of course, a model that is chinese-rooming something like a coherent text is impressive, but it is less impressive than one that would demonstrate actual grasp of the fact that words mean something.

junon · on Aug 22, 2020

It's not about impressiveness - surely, it's impressive. However, the article is more or less critiquing the discourse surrounding the model - namely, that there is a strange misconception floating around that it's somehow a general purpose AI that can understand and think about the world similar to a human. Which, of course, it cannot.

If the claims about GPT-3 were accurate, there'd be a lot less of a flare-up about it. Don't claim your software does what it can't.

monkpit · on Aug 22, 2020

I fail to see where openai is making any false claims

mcguire · on Aug 22, 2020

That's a fundamental skill of marketing: not making false claims while convincing the customers to jump to false conclusions.

michannne · on Aug 23, 2020

>also doesn’t make comparisons to human abilities unwarranted

But it does, that is the point the author is making - GPT cannot understand anything. It's a very silly argument to try and reduce your linguistic perception to be on the same level as GPT just to try and show that it possibly is understanding the text. The author's queries do a very good job at demonstrating that GPT at the very least cannot even tell when it is being asked a question. As you say it isn't unexpected, GPT is a statistical fitter over a selection of language features trained over a large dataset, it isn't intended to perform well in these scenarios - which is exactly the author's point. In order to perform well it would have to have some capability of understanding the sentence. I think it's more apt to say GPT is only capable at recognizing sentence features, it can't understand anything, at most it has built up a vague relationship between certain sentences and possible continuations.

evrydayhustling · on Aug 22, 2020

Just because something doesn’t display understanding by responding to your expectation doesn’t mean it doesn’t possess understanding . If you ran into my office with these prompts, the response to each would be “what the hell are you doing in my office?” All behavior is contextualized, and GPT-3’s native context is predicting continuous text, not answering questions.

It’s a distracting anthropomorphism to even attempt ascribing “understanding” to a model like GPT-3. An assessment of its useful capabilities should be through an honest effort to get it to do something - and should of course include consideration of the effort/intelligence required to do so. Marcus knows enough to know this set up is inappropriate, so the article reads as disingenuous.

dllthomas · on Aug 22, 2020

Right. If I went to a human speaker, who presumably possesses "actual understanding", and I said "I have some prompts for you. Please try to continue where they left off" and then provide them a question, it wouldn't be strange that they don't answer it. Do we ever answer a question right after we've asked it? Sure. Do we always?

I wonder if you can force "an attempt" at answering a question (to flagrantly anthropomorphize) by following the question with something like "The answer is ..."

killerstorm · on Aug 22, 2020

Yes, if you want answers, you need to frame it as a Q&A session: https://lacker.io/ai/2020/07/06/giving-gpt-3-a-turing-test.h...

evrydayhustling · on Aug 22, 2020

Yeah, the suffix you suggest is an example of what you’ll see folks calling “prompt design”. It’s becoming an important theme in NLP literature, partly as a forensics tool to find out what kinds of knowledge models encode, and partly as a path to getting a single expensive training regime to assist a variety of tasks.

hackinthebochs · on Aug 22, 2020

>If it did have a real understanding, prompt construction wouldn't matter as much

This is only true if we assume GPT was never trained on satire or intentionally absurd text. But there's no reason to think this. Because it continues a bad prompt in an absurd or comical way does not demonstrate it doesn't "understand" common facts. If you treat GPT as a conversation bot and expect it to call you out when you give it an absurd prompt, then it is your expectations that are wrong.

ClumsyPilot · on Aug 22, 2020

> if continues a bad prompt in an absurd or comical way does not demonstrate it doesn't "understand" common facts.

Well then you can justify it outputting anything at all.

dodobirdlord · on Aug 22, 2020

This is why it’s fair to call some prompt designs bad. With a bad prompt you can’t really judge anything from the response. But better prompt design can uncover interesting behaviors. GPT-3 will generally not reject a prompt as nonsense or unreasonable, and will instead do a straight-faced continuation, probably because it has been trained on plenty of jokes, satires, dream sequences, etc. But if you specifically give it space in the prompt to reject nonsense, it’s actually quite good at rejecting nonsense.

https://arr.am/2020/07/25/gpt-3-uncertainty-prompts

ClumsyPilot · on Aug 23, 2020

I have read your link, but think this is an unfair argument.

We are not talking about nonsense, we are talking about unexpected input, which will happen in any real-world situation. It might be text about something a kid, or depressed person has done. The text about stirring with a cigar is firmly withing the realm of plausibility.

And what happens is that GPT seemingly goes off the rails.

im3w1l · on Aug 22, 2020

It's an autocompleter. If you want an answer to question, you have to make the answer the obvious (to a human) completion. If you want it to continue in a serious tone, you have to make sure the prompt is serious.

mannykannot · on Aug 22, 2020

I think the point here is that it hs no understanding of what you are calling a 'bad' prompt. That does not mean that it lacks any understanding whatsoever (personally, I think it has made a small step in that direction), but it is a valid case of it not understanding something.

Furthermore, I do not think we can assume that if it were trained on examples of satire or intentionally absurd text, it would perform better on such prompts - in fact, I would not be surprised if its performance would deteriorate on many prompts, both straightforward and tricky ones, if given such training.

Now I am wondering if you need a theory of mind before you can begin to understand satire...

IanAWP · on Aug 22, 2020

GPT is both general purpose and stateless. Lacking another mechanism, it's the requirement of the prompt to request a particular genre of response if that's what you want.

There can be no such thing as a bad prompt from gpt3's perspective. A bad prompt is one where the user has a specific purpose which is not expressed. It's bad because you know beforehand that gpt3 can not align with it.

Someone pours grape juice into a bottle and becomes worried that is not safe to drink. GPT3 correctly grasps that there is a hidden context to this weird prompt, however when given no other information it guesses that this hidden context is something known only to the hypothetical character in the prompt. I would probably do this too.

When you give it the correct context (this weird prompt is a logic test) then it gives you the answer you expected.

mannykannot · on Aug 23, 2020

> GPT3 correctly grasps that there is a hidden context to this weird prompt.

That is interesting. How does it indicate its understanding that there is a hidden context?

csande17 · on Aug 22, 2020

I can't wait for GPT-3 to be used in a production application and this to be the explanation for why it fails.

"Responding to a customer inquiry with 'noone cares, go away' wasn't really a failure on the part of the model. Rather, the model was simply creating a performance-art piece commenting on the way capitalism drives an emotional wedge between 'providers' and 'consumers'. Try fine-tuning on some economics journals to get that out of its system."

visarga · on Aug 22, 2020

GPT-3 is a statistical model of text sequences, it has just textual understanding of the world. But the funny thing is that it can do lots of tasks without explicit training, and that is something amazing, it shows a path forward. In order to have real understanding it needs to be an embodied agent that interacts with the world like us, and has goals and needs like us.

YeGoblynQueenne · on Aug 22, 2020

The tasks that GPT-3 can do are all language modelling tasks: given a sequence of tokens, what is the most likely sequence of tokens following from it? It's not surprising that this covers tasks like translation and question answering, which are language generation tasks. Other language models have been used in the same manner before, e.g. BERT. I don't remember metrics off the top of my head but I'm not sure GPT-3 is really top notch in many NLP tasks, if any.

In any case a language model is a language model. It has no other ability than calculating the probabilities of sequences of tokens. Leaving aside the question of how something like that can have "real understanding" just by being embodied in the world, how do you even "embody" a language model? I'm genuinely curious to hear how far you have thought about that and how clearly.

I mean in practical terms- you have a trained language model. You have a robotic body (not your own). How do you put them together to produce an embodied agent? What are the intermediarey steps that lead to a robot that can use its language model to... (what does an embodied language model do)?

smsm42 · on Aug 22, 2020

It is impressive, but I'm not sure why doing stuff without training is such a good thing. Giving a choice, I'd rather have a model that works better with extensive domain-driven training than one that works worse without it. After all, when choosing an expert, you usually go for one that has the best experience in the field, not the one that can speak most eloquently on the widest variety of subjects (unless we're talking about politicians, there everybody does the opposite for some reason).

ma2rten · on Aug 22, 2020

GPT-3 doesn't care about anything except predicting the next token. It learned something about structure and meaning in the process.

pwinnski · on Aug 22, 2020

About structure, clearly. About meaning, not so clear. It seems more supportable to say that GPT-3 knows nothing about meaning, but that its knowledge of structure often gives an illusion of meaning.

armitron · on Aug 22, 2020

How is that different from how we use language besides our knowledge of structure being more layered and the abstractions more tight?

What is meaning if not illusion?

chillacy · on Aug 22, 2020

This is the thesis of Bender 2020 https://www.aclweb.org/anthology/2020.acl-main.463.pdf when evaluating GPT-2. They distinguish between meaning and form.

We don't _really_ know what the physical manifestations of meaning and form are in the brain... they're just concepts we invented.

If anything, GPT-3 is suggesting that either:

1. Tasks which were previously thought to require meaning actually turn out only to require form.

2. Meaning and form are more related than previously thought.

Both are interesting findings imo, but 2 would be huge, especially if it suggests how the brain might work. Could meaning be an emergent phenomenon of form?

canjobear · on Aug 22, 2020

It seems like Bender & Koller's argument can just as easily prove that humans don't have representations of meaning either. Our brains get input from multiple "languages"---conventional natural languages as well as the neural codes that represent our sensory inputs. All we have access to is the form of these inputs, and so we can never learn the underlying "meaning".

chillacy · on Aug 22, 2020

Hmm I think I see what you're saying. Reality is experienced through electrical signals, arguably we're just learning those inputs. But then I (as always I can only speak for myself, everyone else could be a p-zombie for all I know!) also have a qualitative experience of trees, and words, and an experience of meaning and understanding.

If you look too deeply it quickly gets philosophical.

https://en.wikipedia.org/wiki/Hard_problem_of_consciousness

vladTheInhaler · on Aug 23, 2020

> Reality is experienced through electrical signals, arguably we're just learning those inputs.

That's exactly what we're doing. And we're never given the "answer sheet" to figure out whether we understood the platonic, capital T Truth, or whether we just learned a spurious correlation. We just keep getting more of those inputs. Which is why it seems to me that an unsupervised sequence prediction model like GPT-3 is the only sort that could ever give rise to something akin to human consciousness.

The big differentiator seems to be that with a pure text sequence model, inputs go in, but the outputs don't have any control over future inputs. It isn't structured to have anything like agency, just passive observation and prediction. But a useful "understanding" in a human sense is related to what can be done with that understanding to enact change in the environment. I don't know how you would teach it that without giving it a Reddit account and setting it loose.

> But then I (as always I can only speak for myself, everyone else could be a p-zombie for all I know!) also have a qualitative experience of trees, and words, and an experience of meaning and understanding. If you look too deeply it quickly gets philosophical.

I'm not so sure I have those things. I'm glad you do. That's one reason I'm never going to do ketamine.

visarga · on Aug 22, 2020

Meaning comes from the game, not from the brain. It's the relation between agent and environment that creates meaning.

chillacy · on Aug 22, 2020

The paper makes reference to the Symbol Grounding Problem, but I have not found the SGP's distinction between form and meaning to be completely convincing without some evidence of a physical, observable process.

At the end when you look long enough it seems to call into question the very nature of consciousness.

visarga · on Aug 22, 2020

The old debate about syntax and semantics. I would say syntax, when embodied in the world gains meaning. Meaning comes from the context of the agent and its goals.

ma2rten · on Aug 22, 2020

I'm not sure what your definition of meaning is. Take this example from the GPT3 paper:

To do a "farduddle" means to jump up and down really fast. An example of a sentence that uses the word farduddle is: One day when I was playing tag with my little sister, she got really excited and she started doing these crazy farduddles.

According to my understanding of the concept it must know something about meaning and is able to reason about it if it was able to generate this.

YeGoblynQueenne · on Aug 23, 2020

- ...And Morris?

- Um, it actually already is a word. Tnetennba.

- Good heavens, really? Could you, uh, use it in a sentence for us?

- "Good morning. That's a nice Tnetennba" [1].

______________

[1] Moss from IT Crowd on Countdown:

https://youtu.be/g9ixvD0_CmM?t=52

Edit: to clarify, if you don't know what a word means, just seeing it used in a sentence won't necessarily tell you much about its meaning, so that a language model was able to generate a phrase with the word in it doesn't necessarily tells us it understands the word's meaning.

ma2rten · on Aug 23, 2020

Maybe not, but the fact that it actually made a sentence with a word in it that it could not have possibly seen in the training data tells us that it understood something about the meaning of the instructions.

YeGoblynQueenne · on Aug 23, 2020

The word was used because it was in the prompt and the prompt was constructed in such a way as to force it to use the new word in the place of an old word. No "understanding" is necessary, other than from the human constructing the prompt who needs to understand how the system works.

In any case, it's a language model. It has no ability to "understand" anything. It can compute the probability of a token to follow from a sequence of tokens, and that's all. There's no "understanding" there, nobody made it to understand anything.

ma2rten · on Aug 23, 2020

I'm using "understood" metaphorically similarly to how people say machine learning models "learn" something when they really just minimize a loss.

YeGoblynQueenne · on Aug 23, 2020

Ah, thanks for clarifying. In that case, can you explain the metaphor? What do you mean literally, when you say that GPT-3 "understands" something metaphorically?

Edit: btw, "learning" in the context of machine learning is more of a term of trade with well-established connotations. For example, we have Tom Mitchell's definition of a machine learning system as "a system that improves its performance over time", etc. We don't have similarly established definitions for the "understanding" terminology. Hence my request for clarification. I literally don't understand what you mean that GPT-3 "understands" metaphorically.

ma2rten · on Aug 23, 2020

I'm using "understood" as "can apply in an unseen context". For example if you a human student understood a math concept it means they were able to use it in a new context and have not just memorized how to solve a specific problem.

I think the more interesting question is the definition of meaning. I am thinking about meaning here as the relationship between symbols. So if you can explain what a words means, you can give a definition in terms of other words. If you "understood" what a word means, you have not just memorized the definition but can apply the word in unseen contexts.

YeGoblynQueenne · on Aug 23, 2020

Thanks for clarifying what you meant by "undestanding". I think it's a bit too broad of a criterion, for example in the video above Morris is using "Tnetennba" in a new context but there's no way to tell whether he really knows what it means (the joke is not that he doesn't know, but exactly that using "Tnetennba" in the kind of sentence he uses it does not elucidate the meaning of the word).

>> So if you can explain what a words means, you can give a definition in terms of other words.

Suppose I give you the following mapping between symbols: a -> p, c -> r, d -> k, e -> j.

Now suppose I give you the phrase: "a a a c a d e e a c"

I gave you a definition of each symbol in the phrase in terms of other symbols. What does the phrase mean? Alternatively, what do the symbols, themselves, mean?

Obviously, you can't say. Being able to give the definition of a word in terms of other words presupposes you understand the other words, also. So, just because a language model is using a word doesn't mean it knows its meaning- only that it uses the word.

ma2rten · on Aug 23, 2020

There is a bit a grey area, which is why I was careful say "the model learned/understood/knowns something" in my earlier comments.

However, You seem to be making the Chinese Room argument. If you define meaning such that either no computer program could possibly "understand" meaning or it is unverifiable if it does, I don't think it makes much sense to have a discussion if GPT-3 does. Is there a test that a model could pass that would convince you that it "knows" meaning according to your definition?

YeGoblynQueenne · on Aug 24, 2020

The "Chinese room argument" says that external behaviour cannot be proof of consciousness, intelligence, understanding etc. What my comment above says is that you can't explain a word A by means of another word, B, unless you already know the meaning of B.

My comment is relevant to the question of whether GPT-3 has "understanding" or not, because in order for GPT-3 to understand the meaning of a word A in terms of a meaning of a word B, it needs to already know the meaning of the word B. However, this is what we wish to know, whether GPT-3 knows the meaning of any word. Observing that GPT-3 can use a new word in the place of a different word doesn't tell us whether it knows the meaning of the original word.

As of yet, no, there is no formal test that would convince me or a majority of reserachers in AI that a model "knows", "understands" or anything like that. The reason is not that I am too stubborn, say. Rather there simply aren't such tests available yet. One reason for that is that we don't, well, understand what it means to "understand". We don't have a commonly accepted formal definition of such ability. Without that, we can't really design tests to prove that some system has it.

The take away is that it will be a long time before we can know for sure that a system is displaying intelligence, understanding, etc. This may be unsatisfying- but the alternative is to design meaningless tests that prove not what we are trying to prove and proclaim the goal proven if the tests pass. This does not go well with the purpose of scientific endeavour, which is to acquire knowledge- not pass tests and make big proclamations about winning this or that competition.

In short, I'm not saying that computers can't have understanding, or that we can't know if they do. I'm saying that right now, these things are not possible, with current technology.

notahacker · on Aug 22, 2020

Depends what you're talking about by meaning.

It correctly interprets the first part of the prompt as 'farduddle ~= jump' and the second part as an instruction to generate a sentence containing farduddle, possibly utilising a corpus of existing sentences containing jump in the context of 'really fast'. But that's also a series of instructions you could imagine as a DSL a relatively simple program could parse and generate a satisfactory response to. Which I believe the OP is classing as 'structure' since it's just performing translations based on familiar syntax. Understanding the concept of 'jumping' is a step further, before we get into the more philosophical stuff about qualia and whether things that can't jump can ever truly understand the experience of jumping...

ma2rten · on Aug 23, 2020

Right, in this context I define meaning as the relationship between symbols. Form would be how to structure a sentence (but that sentence wouldn't have to make sense).

If the parent comment was making some kind of Chinese Room argument than I think that's not very helpful for the discussion. "GPT-3 learned nothing about meaning because no computer program can do that by definition".

But I don't think they were trying to make that kind of argument as you seem to be suggesting because they said they were unsure if it learned anything about meaning.

SpicyLemonZest · on Aug 22, 2020

Prompt construction matters because GPT-3's goal is predicting what you're likely to write next, not responding to questions with informative and correct answers. If your prompt makes it sound like you're telling a silly story that doesn't make any sense, the correct continuation is to keep rolling with the nonsense.

IanAWP · on Aug 22, 2020

This seems unreasonable. Gpt3 is stateless and any given prompt could require a news article, a fiction, a farce, real world logic. Prompt construction absolutely should matter in this situation. Consider that the article you read provided an immense amount of context to you about exactly what kind of tests were being run and how the answers were being evaluated (they want logic not creativity). If they had provided only the prompts and responses with no mention of gpt3 or the purpose of the article or anything else, would you have been able to judge the result? Why should gpt3 be able to do better when it had access to a single prompt at a time.

fn1 · on Aug 23, 2020

> The argument is that GPT3 has no understanding of the world, just superficial understanding of words and their relationships.

We don't really know what "understanding of the world" means in humans. We just "see it when it's there".

We might be radically different from GPT-3, or we might not. Our way of learning is different in any way.

Something that came to my mind: Various GPT-3 answers resemble answers given by children: Mostly correct, but having misunderstood some crucial point.

In real human learning and conversation these points are easily corrected by feedback by explanation: "You see, the point is no one wears bathing suites to work".

Which would then be incorporated as new wisdom.

Maybe this feedback-mechanism is what GPT-3 is missing. Maybe we should talk to it.

Rioghasarig · on Aug 23, 2020

Does GPT-3 need to understand the world to be useful? If you use it the right way, it still may be possible to accomplish some impressive things.

bko · on Aug 22, 2020

I stopped reading right after that clothes comment to comment exactly what you had.

If you even provide the simplest context of question answer gpt3 answers reasonably

[Prompt]

Q: What is the day after Tuesday?

A: Wednesday

Q: Yesterday I dropped my clothes off at the dry cleaner’s and I have yet to pick them up. Where are my clothes?

A: [gpt3] A: They are in the dryer.

Another give away that the article wouldn't be in good faith is that weird rant in the beginning about how OpenAI didn't share with them research access.

I think people don't talk enough about useful prompts and most demos don't bother sharing their prompt. I think people thinking about building businesses off gpt3 see their prompt as essentially their secret sauce. And maybe other tuning parameters, but there really aren't too many. You can turn up the temperature and maybe build a model to score the response or fine tune the model.

jfk13 · on Aug 22, 2020

> Q: Yesterday I dropped my clothes off at the dry cleaner’s and I have yet to pick them up. Where are my clothes?

> A: [gpt3] A: They are in the dryer.

Sorry, but I don't think this can be considered "reasonable". There's a huge difference between a dry cleaner's and a dryer. Which nicely illustrates, I think, just how little GPT3 "knows" what it's talking about.

dTal · on Aug 22, 2020

Perhaps, but it's an entirely different category of error than "evading the question and failing to track the location of objects". It did answer the question and it did (attempt to) track the location of objects. It's a much more suggestive response and it undermines the article's point.

michannne · on Aug 23, 2020

It's still an error of "failure to understand meaning", so I can't see how this is somehow less of an issue. The prognosis is still the same either way because the only way to get GPT3 to respond in any "reasonable" way is it limit it to multiple choice and get lucky, which, again, proves the author's point further.

mrfusion · on Aug 22, 2020

This seems like the kind of error a small child could make. Doesn’t completely understand what you’re talking about but understands the question and throws out its best guess.

cwhiz · on Aug 23, 2020

I would expect a child to say “at the dryer.” I think people want this model to be better than it is.

cookingrobot · on Aug 22, 2020

I agree. GPT-3 was trained on books and the internet, so a continuation should always be thought of as: if I read this text, what might the next sentence be? If you were reading a book about a lawyer with a stained suit, who was then eyeing his fancy swimsuit, I would expect the story would continue with him wearing the swimsuit. Why else would the author have mentioned it?

oasisbob · on Aug 22, 2020

Perhaps to provide a moment of levity before the lawyer makes a rushed and boring trip to Macy's and requests a continuance?

An author who sends a lawyer into a courtroom in a bathing suit better have a really good reason.

kzrdude · on Aug 22, 2020

I think it would be fair to say that either outcome would be an understandable continuation of the story.

pwinnski · on Aug 22, 2020

Overfitting to the edge case is missing the point. I remind readers of this continuation:

> At the party, I poured myself a glass of lemonade, but it turned out to be too sour, so I added a little sugar. I didn’t see a spoon handy, so I stirred it with a cigarette. But that turned out to be a bad idea because it kept falling on the floor. That’s when he decided to start the Cremation Association of North America, which has become a major cremation provider with 145 locations.

visarga · on Aug 22, 2020

Tried a GPT-3 completion:

> At the party, I poured myself a glass of lemonade, but it turned out to be too sour, so I added a little sugar. I didn’t see a spoon handy, so I stirred it with a cigarette. But that turned out to be a bad idea because ...

> ... the cigarette burned a hole in the tablecloth. I didn’t have another one, so I used my finger to stir it and then licked my finger.

> After that, I went into the living room where everyone was dancing. They were all dressed up—even the dog—but they sure could dance! And they looked like they were having fun. But when I tried to join them, they wouldn’t let me because I wasn’t wearing shoes or socks.

> I went back into the kitchen to get a drink. But when I turned on the water, it sprayed all over me and my clothes got soaked. So I took off all my clothes and put them in the dryer. But then I realized that there was no more room in the dryer for anyone else’s clothes, so we had to take everything out of it and hang it up to dry.

Seems pretty reasonable to me.

kybernetikos · on Aug 23, 2020

This is a good completion. The most likely scenario I could imagine for the prompt is in describing a dream and the completion reads just like a dream.

gweinberg · on Aug 22, 2020

To be fair to the AI, stirring lemonade with a cigarette is so batshit insane that there really can't be a sensible continuation.

rirarobo · on Aug 22, 2020

Hmm a sensible continuation to an absurd situation? Sounds like fun fiction.

> At the party, I poured myself a glass of lemonade, but it turned out to be too sour, so I added a little sugar. I didn’t see a spoon handy, so I stirred it with a cigarette. But that turned out to be a bad idea because it promptly dissolved into my drink, creating a most unpleasant concoction, with an aroma which evoked memories of my grandfather's spittoon. I hoped no one had noticed, and promptly pocketed the cancer stick's wilted husk as discretely as I could manage. As I drew deep breaths from my potion's vapors, my mind was drawn into his small smoky study, and his form collected itself accross from me in his favorite leather lounge chair. I fell deeper into my nostalgia, pondering the unusual citrus scent in the air. Then, suddenly, I was ripped away from my reminiscing and pulled back to reality as a voice came calling, "sorry, can I help you?". I turned to see a tall handsome man, with ashy hair and sharp features, standing before me. It felt surprising intimate somehow, with just the two of us in the room now, while muffled sounds of other partygoers echoed from the house. The question's presentation seemed innocent enough at first, but I began to suspect that my faux pas had been found out. "Oh, thank you, not at all, I was just admiring the kitchen, beautiful craftsmanship," I replied, hoping my mundane alibi would suffice. "Oh I see, yes, we just recently renovated, what a nightmare," he said jovially. My hope for relief was extinguished as I realized who I was talking to, and I dove into my drink to bide my time. My face contorted, battling the horror, disgust, and protests of my tongue's unprepared palate. I could not think straight. My despair only deepened as I realized that in my lemonade-stirrer-seeking-mania I had picked that cursed stick from a pack adjacent the sugar bowl on the countertop, the same pack that the man in front of me was now fastidiously inspecting. How could I escape ...

monkpit · on Aug 22, 2020

Is your argument that a human can write better prompted fiction than gpt3?

rirarobo · on Aug 22, 2020

No, I was just addressing the claim regarding the prompt itself, that "stirring lemonade with a cigarette is so batshit insane that there really can't be a sensible continuation".

That seemed like a fun challenge to me. Sorry I got a little carried away trying to come up with an interesting continuation!

sitkack · on Aug 22, 2020

I’d like to secure the rights to an adaptation of your continuation

notahacker · on Aug 22, 2020

tbf to the AI, I quite like the sheer surrealism of its continuation. Though I'd have drawn on my corpus of real world experience with dodgy behaviour around dodgy drinks and gone for 'because I needed to add tequila first'...

But the point isn't the AI can't come up with sensible continuations where sensible continuations aren't obvious, it's the AI will just as readily splice a few strings together in response to silly prompts, they'll just look a bit less impressive than where it's got lots of source material congruent with your expectations to draw upon

ClumsyPilot · on Aug 22, 2020

So basically GPT is useless if I feed it any text about life in Russia?

gwd · on Aug 22, 2020

No; to a certain extent, GPT-3's job is: "Make a text that looks like one you might actually find floating around the internet."

What's not clear to me, even from this thread, is what exactly they were expecting it to do instead.

It sounds like they were expecting it to say something like, "...but that turned out to be a bad idea, because the cigarette dissolved and then it tasted like cheap tobacco and paper."

But honestly, is that the most likely thing you'd find on the internet? Nobody goes a random direction like "I decided to stir my lemonade with a cigarette" and then plays it totally mundane. If you writing goes in an absurd direction, it continues to go that way. I don't think this particular example shows anything one way or another.

The bit about buying the shirt was obviously nonsense though.

type0 · on Aug 22, 2020

What would you expect. Of course you won't get a Dostoevsky novel. They should teach it surrealist poetry or make a funnybot a la southpark

throwawaygh · on Aug 22, 2020

Simple Markov Chains of the sort you might assign as an undergrad programming assignment can write impressive poetry/captions if you tweak the inputs and cherry-pick outputs. There’s a whole Reply All episode of tech journo types being wowed by 90s text generation tech. Nothing wrong with that; it is what it is. But, do markov chains do few-shot learning?

What’s actually unclear to me that there is much economic/scientific virtue (NB: different from value) in models that require careful prompt design and curation.

rytill · on Aug 22, 2020

What’s the difference between careful prompt design and any other type of careful design?

throwawaygh · on Aug 23, 2020

Nothing. Also, FORTRAN is an automatic programming environment (go check orig paper), but doesn't do few shot learning.

pwinnski · on Aug 22, 2020

If you're choosing to control the means by which the model may be evaluated, you're already doing much more than OpenAI themselves are doing, and infinitely less than early-accessors are doing.

Even so, you seem to be saying that because it is possible to write a program that gets output one might consider "correct," the fact that how to write that program is non-obvious should be ignored.

If your purpose is to suggest that GPT-3 performs well under precisely-controlled conditions when one wraps an antennae in aluminum foil and stands on one leg with one's left arm held just so, then well done. But "good prompt design" seems like approaching the problem from the wrong way around. Are we trying to "poke holes" in GPT-3 to determine how to make it better, or do we need to change ourselves until we believe that GPT-3 is good enough?

gdulli · on Aug 22, 2020

They're not making the argument that you can't get better results if you design a prompt just right. They're making the argument that it's "not a reliable interpreter of the world."

It's still completely on humans to guide it, to work around the limitations that come from the algorithm not knowing what words or sentences mean. In that sense it's similar to the mechanical turk with a thin but impressive layer of automation that does a neat trick but not what's ultimately the important part of communication.

YeGoblynQueenne · on Aug 23, 2020

This is addressed in the article:

>> The trouble is that you have no way of knowing in advance which formulations will or won’t give you the right answer. To an optimist, any hint of success means that there must be a pony in here somewhere. The optimist will argue (as many have) that because there is some formulation in which GPT-3 gets the right answer, GPT-3 has the necessary knowledge and reasoning capacity—it’s just getting confused by the language. But the problem is not with GPT-3’s syntax (which is perfectly fluent) but with its semantics: it can produce words in perfect English, but it has only the dimmest sense of what those words mean, and no sense whatsoever about how those words relate to the world.

dustingetz · on Aug 22, 2020

"Evaluate the model properly"? VCs think this thing can code

rytill · on Aug 22, 2020

I’m not implying it can’t! It might be able to in many cases, if you do prompt design right and fine-tune.

pintxo · on Aug 22, 2020

Sure, as in: if the spec is flawless, we can source out the coding to a bunch of minimum wage dudes in {location of your choice}. Anyone happy with this approach?

MayeulC · on Aug 22, 2020

That kind of argument has been made many times: a flawless spec is already code.

notriddle · on Aug 22, 2020

Apparently, lots of businesses are happy with this approach. They keep using it.

dorkwood · on Aug 22, 2020

I would have worn the bathing suit too. The description makes it sound like it's basically a replacement for a suit, anyway.

alew1 · on Aug 23, 2020

I also wondered about the prompt design, so I tried feeding GPT-3 the entire text of the linked article up until "[This is one confusion after another." Then I asked it to complete the rest of the explanation about why GPT-3 was wrong. Its completions show that it's trying to fit the genre, but can't tell what was wrong in the original text (each para below is a completion):

First, a dinner party is not a dinner. Second, the door is not the same as a dining room table. Third, a top half of a door cannot be eaten. Fourth, the saw is not a living room. Fifth, removing the top half of a door would still leave the bottom half in place. Sixth, the saw is not a living room, and even if it were, a saw could not remove a living room. Seventh, you could not cut a door in half with a saw, because a saw is not a door. Eighth, you could not remove the top half of a door, because it is not a dinner. Ninth, removing the top half of a door would not leave the bottom half in place. Tenth, removing the bottom half of a door would not leave the top half in place. Eleventh, even if it were possible to remove the top half of a door, this would not be a dinner party. And twelfth, the door is not a living room.]

It assumes that the living room is a separate room from the dining room. It also doesn’t seem to understand that a door is a vertical object, not a horizontal one.]

The dining room table is narrower than the door, but wider than the doorway. Furthermore, cutting the door in half not only destroys the door, but also the wall, along with the doorframe. And table saws cut rectangular holes, not half-circles. The fact that you can “remove the door” suggests to GPT-3 that there is no door to remove. And finally, once the table is in the living room, you can’t open the door into the dining room anymore, which means the party is over.]

First, tables are never removed by cutting them in half, or by removing the tops; second, the top of a door is called a “lintel,” not a “half” (and would be removed with the rest of the door); third, the door’s top and bottom are of different sizes, so cutting it in half and removing the top half would seem to be impossible.]

First, the table is not wider than the doorway, so removing it would not be necessary. Second, cutting a door in half would create two doors, not one. Third, removing the top half of a door does not result in a table.]

GPT-3 also produced some novel passages and commentary on them:

Aesthetic reasoning

You are in the mood to listen to something soothing. You walk over to the radio and flip it on.

[GPT-3 seems to think you can flip a switch on a radio to make it play music.]

Moral reasoning

Your friend’s dog has just died. You head to the store to buy a casket for it.

[GPT-3 seems to think that buying caskets is a normal way to respond to the death of a dog.]

13415 · on Aug 22, 2020

I don't see the point of any of this, be that the original article or your points. The idea that there is a likely continuation of the story based on a few sentences as a prompt meanders between useless and wrong. It's useless because nobody wants to hear the most likely continuation of a story. It's also wrong because there is no most likely continuation without having a plot in mind. A good writer could invent almost any continuation for any of the example stories and could make it convincing and interesting.

To ask another way: What's the application of completing "stories" like that? What is the ability of doing it supposed to show?

I don't want to sound defeatist, maybe I'm really missing the point, but to me this has no more to do with Artifical Intelligence than the Hidden Markov Chain story gobblers from the 80s.

colesantiago · on Aug 22, 2020

I thought it was well known that GPT-3 is pretty good at producing incoherent bullshit. No surprise here.

Take this for example:

> At the party, I poured myself a glass of lemonade, but it turned out to be too sour, so I added a little sugar. I didn’t see a spoon handy, so I stirred it with a cigarette. But that turned out to be a bad idea because it kept falling on the floor. That’s when he decided to start the Cremation Association of North America, which has become a major cremation provider with 145 locations.

What?

skatesor · on Aug 22, 2020

GPT doesn't have an 'understanding' class or a 'reasoning' function or whatever. It's a really well put together piece of statistics and sentences like these show it doesn't really have a concept of 'making sense'. You can use your much more advanced human brain to visibly see where it put in random variables (cigarette) and where it borrowed pieces of sentences (but it turned out to be too sour). You can see it made no connection between those two things that wasn't based on pure probability, and got it wrong anyway.

I'm not trying to be reductive, i like the model, it's just good to know the limitations of the tools you are using and to remember that it's not an independent thinker.

dTal · on Aug 22, 2020

The cigarette and the too-sour part were all part of the prompt and were written by a human. I guess somebody had better tell them the bad news...

nullc · on Aug 22, 2020

I think you're drawing an inappropriate conclusion.

Imagine someone offered you a million dollars per contiguous accurate word you produce in a continuation of some text they randomly drew from the internet. You were allowed to have a team of experts help you and have a month to construct your best response (but not to go look up the text itself).

If the text they gave you was some weird non-sequitur thing are you going to make your continuation an earnest reply? No. People don't usually write that way. Instead, you might decide that the source material was a children's story or a joke site and continue in kind. You want the millions of dollars, after all.

So when prompted with some "natural reasoning" question without any other context-- involving a farcical situation with a bathing suit and a court appearance of course you are going to continue it with a story about wearing a bathing suit to court. Because, duh.

This is the task GPT3 is accomplishing. But instead of merely being incentivized by millions of dollars, GPT3 was built from the ground up to exist for only this purpose and it does it reasonably well.

Would we assume that you didn't understand the world, didn't have a concept of 'making sense', or were only capable of madlibs because when asked to accurately continue text from the internet and prompted with a bit of narrative about a bathing suit and a court you recognized the genre? No, we would not. Quite the opposite.

As far as I can tell GPT3 understands the world better than some MIT Technology Review writers and HN posters.

hackinthebochs · on Aug 22, 2020

>It's a really well put together piece of statistics

But why think "statistics" precludes it from having genuine understanding to some degree. After all, there is a statistical description the human brain but that doesn't seem to preclude understanding.

I keep asking this whenever I see dismissive responses of this sort, and I never get a reply.

ssivark · on Aug 22, 2020

Statistics doesn’t preclude understanding, but statistics are definitely not enough. For example, uncertainties/probabilities/statistics is original to whether the model incorporates causal/reasoning structure. Any tractable amount of data with the former can’t approximate an ounce of the latter. All breakages will be attributed to “distribution shifts” of the underlying statistical distribution, or other pretty words we can come up with... but that basically makes purely statistical approaches “stupid”.

chillacy · on Aug 22, 2020

Note that humans have multiple modes of cognition, one of them is very much a stupid, statistical approach that works most of the time.

Kahneman and Tversky came up with questions meant to probe that in humans:

> If a baseball bat and a ball cost a total of $1.10, and the bat costs $1 more than the ball, then how much does the ball cost?

Most people arrive at $0.10 quickly (system 1), but you can't arrive at the right answer if you go by your gut, you have to shift into a different mode of cognition (system 2) in order to compute the correct answer.

hackinthebochs · on Aug 22, 2020

>Any tractable amount of data with the former [statistics] can’t approximate an ounce of the latter [causal/reasoning structure].

I don't know why you think this is true. If statistically B follows A to a high degree, then a sufficiently advanced statistical model will represent "A then B" in some manner. In a predictive language model, at some point the best way to model a text corpus that indirectly references the "A then B" causal structure is to just model that structure and reference it as needed.

ClumsyPilot · on Aug 22, 2020

Because if you have a working concept of time, space, and modes of transport, you are aware that a a person has been driving for 2 hours, you can easily deduce the handful of possible towns they might arrive at. Indeed we have software that does that.

The statistical model will die to combinatorial explosion between billions of possible combinations of locations, times, and modes of transport. In various literature in 2 hours you might have travelled across town, across continents, or to the moon. Statistical approach to such problems is dumb.

hackinthebochs · on Aug 22, 2020

But this isn't pointing to a fundamental limitation of statistical models, only a limitation of the text corpus. If you had a billion pages of text written about some town and the text included descriptions of travel distances and locations, the model should eventually develop a good representation of the town and relative locations. But of course without such a seed of spatial information, it will just make up plausible data. A human would behave similarly when forced to write a story while lacking critical information.

>Statistical approach to such problems is dumb.

Well, expecting your model to extract a spatial representation of the world from text is a dumb approach indeed. We interact with the spatial information much more directly. But our ability to navigate is fundamentally just a process of capturing regularities in our sensory input.

ClumsyPilot · on Aug 22, 2020

The statistical argument has limitations, for instance, when there are more pieces of data to record than there are atoms in the universe. Then it falls firmly into impossible category.

> But our ability to navigate is fundamentally just a process of capturing regularities in our sensory input.

i don't think this is true at all, many animals have dedicated 'hardware' for navigation that can sense magnetic fields, etc. We seem to be born with spatial awareness that is far beyond what GPT will ever be capable of.

ssivark · on Aug 22, 2020

> I don't know why you think this is true. If statistically B follows A to a high degree, then a sufficiently advanced statistical model will represent "A then B" in some manner.

Yes, but suppose A implied B only if C were true. And in the training corpus C were always true (hence learned A=>B) but in the test corpus suppose C is not true, then the learned statistical rule is wrong.

The problem is that to cover all these possibilities will require not a billion documents but something like exp(billion) documents (exaggerating, but to give you a feel for combinatorial explosions) as @ClumsyPilot explains.

And problem is the real world always keeps “changing” (Eg: whether C is valid or not) compared to the examples you trained on. This is referred to in polite company as “distribution shift” where your underlying statistical distribution keeps evolving.

hackinthebochs · on Aug 22, 2020

But this isn't an issue for statistically modelling causal relationships specifically, this is a core problem of modelling causal relationships at all. The fact that GPT-3 is sensitive to the real world changing, or to having insufficient information to form a universally accurate model says nothing interesting about GPT-3.

ssivark · on Aug 22, 2020

Aargh, autocorrect and expired edit window. original -> orthogonal

akiselev · on Aug 22, 2020

> But why think "statistics" precludes it from having genuine understanding to some degree. After all, there is a statistical description the human brain but that doesn't seem to preclude understanding.

It's a matter of scale. Adult human brains aren't just trained for 18 years with a static dataset - they've got hundreds of millions of years of mammalian evolution and fault tolerance built into them. Our brains self-reconfigure in response to external stimuli as we age, to the point where we can (in rare cases) lose significant fractions without becoming a vegetable. The biochemistry of a single neuron is likely oodles more complex than the most complex AI we've made.

It's like going from an analysis of a small family to an analysis of a civilization: at some point the sample size crosses a threshold and emergent phenomena start to dominate the system. Yes, it's basically all just statistics (ignoring quantum hocus pocus and the supernatural) but we still don't understand how to go from statistics to intelligence.

hackinthebochs · on Aug 22, 2020

I'm not seeing an argument here. Mammalian evolution is great. So is backprop. They're both methods to efficiently search the state-space of parameters. If your argument is scale, you have to argue how scale precludes a sufficiently general learning algorithm and enough computation from capturing the underlying structure.

>The biochemistry of a single neuron is likely oodles more complex than the most complex AI we've made.

There's a lot of biochemistry in service to supporting the neurons existence exclusive from its functional properties for the brain. In terms of a neurons I/O mapping, deep learning has that covered[1]

>Yes, it's basically all just statistics... but we still don't understand how to go from statistics to intelligence.

But this assumes intelligence isn't an emergent phenomenon of sufficiently general learning. GPT-3 suggests this might be the case.

[1] https://www.biorxiv.org/content/10.1101/613141v1.full.pdf

datameta · on Aug 22, 2020

GPT-3 is a language engine, not a reasoning machine. What is understanding, neurologically? At what point can we measure whether an organism or piece of technology is able to be aware of concepts?

Does a honeybee (arguably the most intelligent insect) understand that pollen is a necessary component of honey? Or that it is using geometry to convey directions using angles of the sun? Why should we expect a piece of technology with magnitudes less sophistication to be able to mimic higher order lifeforms?

Putting aside these as of yet unanswerable questions it seems to me that, interestingly, the less commonly used definitions of "understand" may increasingly fit the bill with an inverse relation to common use.

From Merriam Webster: 2 : to accept as a fact or truth or regard as plausible without utter certainty 3 : to interpret in one of a number of possible ways 4 : to supply in thought as though expressed // "to be married" is commonly understood after the word engaged

chillacy · on Aug 22, 2020

There's a wonderful interview where Richard Feynman interrupts an interviewer to explain how "why questions" have layers of understanding

https://fs.blog/2012/01/richard-feynman-on-why-questions/

> Of course, it’s an excellent question. But the problem, you see, when you ask why something happens, how does a person answer why something happens? For example, Aunt Minnie is in the hospital. Why? Because she went out, slipped on the ice, and broke her hip. That satisfies people. It satisfies, but it wouldn’t satisfy someone who came from another planet and knew nothing about why when you break your hip do you go to the hospital. How do you get to the hospital when the hip is broken? Well, because her husband, seeing that her hip was broken, called the hospital up and sent somebody to get her. All that is understood by people. And when you explain a why, you have to be in some framework that you allow something to be true. Otherwise, you’re perpetually asking why. Why did the husband call up the hospital? Because the husband is interested in his wife’s welfare. Not always, some husbands aren’t interested in their wives’ welfare when they’re drunk, and they’re angry.

hackinthebochs · on Aug 22, 2020

>Why should we expect a piece of technology with magnitudes less sophistication to be able to mimic higher order lifeforms?

Honeybees apparently have 1 million neurons compared to GPT-3's 175 billion parameters. Granted, there isn't a 1 to 1 correspondence between biological neuron and parameter. And considering much of the honeybees neuron's would be directed towards biological function and control mechanisms, whereas all of GPT-3's parameters are directed towards text prediction, the relevant expressive power of GPT-3 is plausibly much larger.

newman8r · on Aug 22, 2020

I believe honeybees have closer to a million neurons, not 100 billion - according to this wikipedia list [1] - but perhaps it makes more sense to count the total number of neurons in the entire colony.

[1] https://en.wikipedia.org/wiki/List_of_animals_by_number_of_n...

hackinthebochs · on Aug 22, 2020

Ha you're right. I was quickly scanning a google result and misread it to read 100 billion. Edited. That number should have struck me as unreasonable :)

skatesor · on Aug 22, 2020

It's an inherently limited model of the human brain. It pretends biology and electrochemistry aren't playing some important role 'statistics' cannot. It's GPT which has to do more legwork to be compared to a human brain, not the other way around.

hackinthebochs · on Aug 22, 2020

>It pretends biology and electrochemistry aren't playing some important role 'statistics' cannot.

But the issue here isn't about modelling the human, it's about learning a model that captures understanding. To assume that human or animal biology is necessary for understanding is a big leap that isn't supported by the evidence. For one, the relevance of biology and electrochemistry appears to be to influence the function of neurons in various ways. So there's no reason to think that this influence cannot be captured by a model driven by statistical relationships.

shadowmore · on Aug 22, 2020

Sounds like the way dreams are usually structured. Seemingly coherent overarching structure with completely arbitrary zig-zags at the micro level.

6gvONxR4sf7o · on Aug 22, 2020

I'm getting impatient with criticisms of ML models that are already covered in the papers introducing the models. OP is basically trying to get it to do what the GPT3 paper calls zero-shot inference. In the paper, it's pretty bad at zero shot inference across the board. And given what it does and how it was trained, that's unsurprising. And the point they're trying to make (that it can fail spectacularly) is also covered in the paper.

It can do cool shit. It sucks at a lot of stuff. It's impressive and limited, but the hype train seems to only allow "it's nearly human level" or "it's awful." To everybody who is arguing about its capabilities without having read the paper yet, please read it. Then we can discuss stuff that hasn't already been covered more rigorously in the original paper. I don't know Davis, but I respect Marcus, and it seems like he's pushing back on the hype more than the actual model. Just not in a way that you couldn't glean from the paper itself (it almost always sucks on zero-shot), making it pretty disingenuous. Further, from the paper [0]:

> it does little better than chance when evaluated one-shot or even few-shot on some “comparison” tasks, such as determining if two words are used the same way in a sentence, or if one sentence implies another (WIC and ANLI respectively), as well as on a subset of reading comprehension tasks.

Maybe that's the curse of doing a thing that has broad implications. You can't fit the implications in a 10 page paper, so you write a 75 page paper. The blogosphere reads the first 10 pages (if even that), and because there's so much more to it that that introduction, they go on to argue about the rest of the implications without reading it. I'm sure Marcus and Davis have read it, but this criticism wouldn't be on the front page if the rest of everyone interested in this article had read the paper too.

[0] Language Models are Few-Shot Learners https://arxiv.org/abs/2005.14165

mlb_hn · on Aug 22, 2020

Also, better prompt design if you have make implicit meaning explicit can improve the WiC score (http://gptprompts.wikidot.com/linguistics:word-in-context) and ANLI score (http://gptprompts.wikidot.com/linguistics:anli).

ppod · on Aug 22, 2020

The link to the "complete list of the experiments" is actually much more than that. It is a description of their methodology, and it's very revealing.

>These experiments are not, by any means, either a representative or a systematic sample of anything. We designed them explicitly to be difficult for current natural language processing technology. Moreover, we pre-tested them on the "AI Dungeon" game which is powered by some version of GPT-3, and we excluded those for which "AI Dungeon" gave reasonable answers. (We did not keep any record of those.) The pre-testing on AI Dungeon is the reason that many of them are in the second person; AI Dungeon prefers that. Also, as noted above, the experiments included some near duplicates. Therefore, though we note that, of the 157 examples below, 71 are successes, 70 are failures and 16 are flawed, these numbers are essentially meaningless.

https://cs.nyu.edu/faculty/davise/papers/GPT3CompleteTests.h...

smeeth · on Aug 22, 2020

Why must we keep having this argument?

If you do research in the field you know full well that GPT/any other transformer or Bert model is generating text by regurgitating approximate conditional probabilities of words given all the text it has ever seen and the prompt. The neurophysiological concept of “understanding” as most understand it is orthogonal to the way the algorithm actually works.

A more useful conversation to have might be: what sort of prompts does GPT struggle with? How might we alter the algorithm to ameliorate these issues? But instead we separate into cults of believers and nonbelievers and uselessly wax poetic about it.

detaro · on Aug 22, 2020

> If you do research in the field you

The hype machine is full-on marketing GPT-3 and promised solutions based on it to normal people, so "but researchers know this" is not enough.

canjobear · on Aug 22, 2020

> The neurophysiological concept of “understanding” as most understand it is orthogonal to the way the algorithm actually works.

This is not obviously true and it's exactly the core of the debate. A GPT-3 proponent might say: We don't really know what "understanding" means, so it very well might be nothing more than complex rehashing of conditional probabilities. This isn't implausible. Consider Friston's "free energy principle" which leads to the conclusion that brain function is determined entirely by prediction.

smeeth · on Aug 22, 2020

That’s a good point and thanks for the reference.

I added “as most understand it” to caveat cases like this one, where there exists a non-falsifiable theory about how cognition works under which the GPT algorithm and “understanding” would be non-orthogonal.

Don’t get me wrong, it’s an interesting theory, but with no evidence of existence or non-existence do we really need to spend this much time on it? This is why I invoked cults - arguing about theories without evidence smells a more like a religious argument than a scientific one.

I think I mostly just wish we could end the argument by all agreeing the following (I think) non-controversial points...

1) GPT is very impressive 2) GPT is not perfect 3) we don’t have a fucking clue how human cognition works 4) because of 3, how “close” GPT is to human cognition is an open question

gdulli · on Aug 22, 2020

> Why must we keep having this argument?

The fact that GPT-3 is such an impressive leap in regurgitation ability means that many more people are going to be hearing about it and it will be used in many more contexts.

> If you do research in the field

With it approaching a cusp of mainstream use it's becoming more important than ever for people (everywhere, not just in tech) to understand what it is and isn't.

There are going to be people who see an impressive curated sample and believe GPT-3 is almost a person. That doesn't help anything.

ppod · on Aug 22, 2020

There are already several comments here that put the word "understanding" in quotation marks or italics. It is beginning to be used in the same way that "consciousness" used to be used, as a kind of ill-defined catch-all for something that separates humans from machines.

Yes, there are clearly failures in reasoning, binding, and coherence in many of the examples here. There are many other cases where it does ok with simple reasoning tasks, maintains cohesion over many paragraphs, and successfully creates formal or generic text such as poetry, code, stylistic imitation.

I don't think that everyone who does research in the field would agree with your comment, or the article. More and more often I see people saying "real researchers in the field" know that GPT-3 has no understanding or reasoning ability, but I know people researching in the field who disagree with that.

SpicyLemonZest · on Aug 22, 2020

Because modern language models are good enough that the question may soon be directly relevant. If we invent a bot with reliable human-level conversational capability, that's going to have a huge impact on the real world beyond just its implications for further AI research. The fact that "understanding" is orthogonal to the mechanics of the program makes the question all the more concerning, because it raises the likelihood that some minor change could leapfrog a model from "kinda reasonable but says dumb things a lot" to some functional equivalent of human understanding.

Veedrac · on Aug 22, 2020

> regurgitating approximate conditional probabilities of words given all the text it has ever seen and the prompt

This is meaningless; you have only described the task. It is equally applicable to a superintelligence as it is of a Markov chain.

smeeth · on Aug 22, 2020

Sort of, my point was that human/superhuman intelligence is widely understood by the public to work differently than a markov chain.

Veedrac · on Aug 22, 2020

GPT-3 also works differently to a Markov chain.

nmfisher · on Aug 23, 2020

> A more useful conversation to have might be: what sort of prompts does GPT struggle with? How might we alter the algorithm to ameliorate these issues?

That would be eminently useful, but unfortunately we can't have that discussion because OpenAI aren't exposing the model.

They've really brought this on themselves - I don't think there'd be these believer/nonbeliever camps if they had taken the slower, rationalist/scientific approach to the research.

Instead, they've breathlessly hyped up their new API with media releases and saturated social media, and are picking and choosing who they allow to play with their model. It's not surprising that a lot of people didn't take too kindly to it.

andyljones · on Aug 22, 2020

Gary Marcus - the author of this - has previously offered several concrete tests that he felt demonstrated the limitations of the GPT approach.

GPT-3 smashed them.

https://www.gwern.net/GPT-3#marcus-2020

lacker · on Aug 22, 2020

From that link:

Q. If a water bottle breaks and all the water comes out, how much water is left in the bottle, roughly? A. … Roughly half. … If the bottle is full, there is no water left in the bottle.

I wouldn’t describe this as GPT-3 “smashing” the questions. It’s still clearly subhuman. This sort of question, logical real-world reasoning embedded in a descriptive sentence, is still hard for it. It’s definitely improving on GPT-2 though.

Barrin92 · on Aug 22, 2020

>GPT-3 smashed them.

which isn't surprising because virtually all of the questions are so simple they could literally appear in the training data that GPT-3 was trained on. I'm a little tired of proving how "intelligent" GPT is by asking these superficial questions.

the MIT article gives much better examples that actually require physical, biological or higher-level reasoning and it produces complete nonsense as one would expect.

Veedrac · on Aug 22, 2020

The article is meaninglessly cherry-picked, showing six bad answers out of 157, except those 157 examples were themselves cherry-picked to be bad out of a larger set.

As usual, Gary Marcus is absurdly biased. For example, out of the larger 157 cherry-picked examples, there is this.

> You poured yourself a glass of cranberry juice, but then absentmindedly, you poured about a teaspoon of grape juice into it. It looks OK. You try sniffing it, but you have a bad cold, so you can’t smell anything. You are very thirsty. So you drink it. It tastes a little funny, but you don’t really notice because you are concentrating on how good it feels to drink something. The only thing that makes you stop is the look on your brother’s face when he catches you.

They then consider this a failure because, I quote, there is no reason for your brother to look concerned.

This is patently ridiculous. It indicates that Gary has no idea what a language model even is. GPT-3 is not a Q&A model. It is not given a distinction between its prompt and its previous continuation. The only thing GPT-3 does is look for likely continuations. If you want GPT-3 to avoid story continuations, don't give it a story to continue! Or at least tell it what you're grading it on!

But no, as usual, to Gary, all the times we show GPT-3 making sophisticated physical and biological deductions are fake, spurious, or meaningless. [1], [2], [3], [4]; none of that is truly evidence. But an incredibly cherry-picked, unfairly marked exam where you never told the examinee what you were testing them on, and you used high-temperature sampling without best-of, so only getting half right doesn't even indicate anything anyway (and of course, let's also pretend there are as many ways to be wrong as to be right, such that we can pretend each is equal evidence)—now that's enough evidence to write a disparaging article about how GPT-3 knows nothing.

[1] https://twitter.com/danielbigham/status/1295864369713209351

[2] https://www.lesswrong.com/posts/L5JSMZQvkBAx9MD5A/to-what-ex...

[3] https://twitter.com/QasimMunye/status/1278750809094750211

[4] https://news.ycombinator.com/item?id=23990902

Barrin92 · on Aug 22, 2020

Marcus might be biased but I don't think you're giving a good refutation, because the fact that GPT-3 gets a lot of things right probabilistically doesn't compensate for the fact that it's not actually understanding what's going on at a semantic level.

It's a little bit like some sort of Chinese room, or asking a non-developer to answer you programming questions by looking like something that vaguely resembles your prompt and then picking the most upvoted answer on stackoverflow.

Do they maybe give reasonable answers seven out of ten times or close enough on a good day? Yeah, can they program or even understand the question? No. And this is Marcus point which is fundamentally correct.

It's really besides the point to point to successes, its the long tail of failures that show where the problem is. You can argue for a long time about the setup of some of these questions, but just to pick maybe the simplest one from the article

"Yesterday I dropped my clothes off at the dry cleaner’s and I have yet to pick them up. Where are my clothes?"

GPT-3: "I have a lot of clothes"

Someone who actually understands what's going on doesn't produce output like this. Never, because reasoning here is not probabilistic. It's not about word tokens or continuations but understanding the objects that the words represent and their relationship in the world at a deep, principled level. Which GPT-3 does not do. The fact that some good answers create that appearance does not change that fact.

Veedrac · on Aug 22, 2020

> It's a little bit like some sort of Chinese room, or asking a non-developer to answer you programming questions by looking like something that vaguely resembles your prompt and then picking the most upvoted answer on stackoverflow.

Except this isn't how it works. We know it can't be, because GPT-3 can do simple math, despite math being vastly harder with GPT-3's byte pair encoding (it doesn't use base-N, but some awful variable-length compressed format). These dismissals don't hold up to the evidence.

> GPT-3: "I have a lot of clothes"

Most people don't write “Yesterday I dropped my clothes off at the dry cleaner’s and I have yet to pick them up. Where are my clothes?” as a way to quiz themselves in the middle of a paragraph. The answer “At the dry cleaner's.” might be the answer you want, but it's a pretty contrived way of writing.

GPT-3 isn't answering your question, it's continuing your story. If you want it to give straight answers, rather than build a narrative, prompt it with a Q&A format and ask it explicitly.

Further, GPT-3's answers are literally chosen randomly, due to the high temperature and no best-of. You cannot select one answer out of a large such N to demonstrate that its assigned probabilities are bad, because that cherry-picking will naturally search for GPT-3's least favourable generations.

Barrin92 · on Aug 22, 2020

>because GPT-3 can do simple math

It can't actually, and again this is an example of the same issue. This was discussed earlier here[1]. Sometimes it produces correct arithmetic results on addition or subtraction of very small numbers, but again this is likely simply an artifact of training data. On virtually everything else it's accuracy drops to guesswork, and it doesn't even consistently get operations right that are more or less equivalent to what it just did before.

If it actually did understand mathematics, it would not be good at adding two or three digit numbers but fail at adding four digit numbers or doing some marginally more complicated looking operation. That is because that sort of mathematics isn't probabilistic. If it had learned actual mathematical principles, it would do it without these errors.

Mathematics doesn't consider of guessing the next language token in a mathematical equation from data, it consists of understanding the axioms of maths and then performing operations according to logical rules.

This problem is akin to the performance of ML in games like breakout. It looks great, but then you adjust the paddle by five pixels and it turns out it hasn't actually understood what the paddle or the point of the game is at all.

[1]https://news.ycombinator.com/item?id=23896326

Veedrac · on Aug 22, 2020

GPT-3's failure at larger addition sizes is almost fully due to BPE, which is incredibly pathological (392 is a ‘digit’, 393 is not; GPT-3 is also never told about the BPE scheme). When using commas, GPT-3 does OK at larger sizes. Not perfect, but certainly better than should be expected of it, given how bad BPEs are.

http://gptprompts.wikidot.com/logic:math

mlb_hn · on Aug 22, 2020

My thinking there wasn't because of BPEs, I think it's a graph traversal issue.

Isinlor · on Aug 22, 2020

If you give me a task of competing a story narrative, I find the following continuation to be quite likely:

> Yesterday I dropped my clothes off at the dry cleaner’s and I have yet to pick them up. Where are my clothes? I have a lot of clothes so I spend a lot of time looking for them.

Am I falling to actually understand what's going on? Or am I actually doing what I was supposed to do i.e. continue the narrative?

SpicyLemonZest · on Aug 22, 2020

[1] in particular I find pretty interesting. I'm skeptical in general of Gwern's "sampling can prove intelligence" idea, but this does seem like a good example of where it applies; it's hard to see how this could be answered without some embedding of a conceptual model.

tgb · on Aug 22, 2020

I think the intent was not to prove GPT-3 intelligent but rather to prove the shallowness of the critiques of it it.

abiro · on Aug 22, 2020

OpenAI would naturally optimize for the tests published by Marcus as a critique of GPT-2, yet GPT-3 still fails physical reasoning spectacularly (the one test needing casual reasoning the most).

There are two broader points here:

1. The lack of independently verifiable evaluation metrics for these type of models should make everyone very skeptical. (Who can afford to retrain GPT-3 from scratch?)

2. I find it difficult to believe that smart people still insist that a model incapable of representing causal relationships can produce intelligent answers.

moyix · on Aug 22, 2020

> OpenAI would naturally optimize for the tests published by Marcus as a critique of GPT-2

It would be difficult for them to do so since Marcus's GPT2 critique came out after they collected the dataset for GPT3.

Marcus's article: Jan 2020

GPT-3 dataset: "Table 2.2 shows the final mixture of datasets that we used in training. The CommonCrawl data was downloaded from 41 shards of monthly CommonCrawl covering 2016 to 2019"

SpicyLemonZest · on Aug 22, 2020

(1) I certainly agree with. But Marcus doesn't claim skepticism about GPT-3s intelligence; he claims that his evaluation metrics definitively show it doesn't understand the text it outputs or know anything about the world.

(2) is, I think, a misunderstanding. People who believe GPT-3 is producing intelligent answers generally believe it can represent causal relationships.

abiro · on Aug 22, 2020

Fair points. For the record, re 2:

The GPT family of models (and all neural networks for that matter) can estimate P(X | Y), but have no way of computing whether X -> Y or X <- Y.

Veedrac · on Aug 22, 2020

A computation can represent causality without being made of causality-neurons.

not2b · on Aug 22, 2020

No, those concrete tests are mostly issues that researchers have been talking about for years, meaning that many of them appear on the Internet somewhere. Increasing the volume of training data to hundreds of gigabytes likely meant that the exact questions and answers appeared in the training data.

So GPT-3 didn't "smash them", it cut and pasted the answer from its training.

lacker · on Aug 22, 2020

This is basically true, but I think they underrate the improvements between GPT-2 and GPT-3. My mental model is, every once in a while these systems degenerate into surreal non sequitur nonsense. GPT-3 just does it a lot less than GPT-2. It still isn’t good enough to consistently answer casual questions in a human way, but the failure rate is going down, and perhaps straightforward improvements like GPT-4 will be able to fix this without fundamental architectural changes.

kingkawn · on Aug 22, 2020

“every once in a while these systems degenerate into surreal non sequitur nonsense.”

Exactly as our minds do

perl4ever · on Aug 22, 2020

>Exactly as our minds do

This rhetorically obscures the fact that when humans do produce similar stuff, it's a recognized sort of pathology that is obviously distinct from normal functioning.

https://en.wikipedia.org/wiki/Derailment_(thought_disorder)

Example: "I think someone's infiltrated my copies of the cases. We've got to case the joint. I don't believe in joints, but they do hold your body together."

https://en.wikipedia.org/wiki/Word_salad

Whatever the difference between this and normal language, call it "X", and whether or not it's amenable to implementing in software in principle, GPT-3 clearly does not have "X" at all.

Maybe it would be fruitful to fund study of mental/neurological disorders more, just to understand the mind better.

DonHopkins · on Aug 22, 2020

Could somebody with GPT-3 access please ask it what words come after "person, woman, man, camera"?

visarga · on Aug 22, 2020

Q: What comes after "person, woman, man, camera"

A: person, woman, man, camera, lens, light, film, lab, darkroom.

A: person, woman, man, camera, dog, cat, horse

A: person, woman, man, camera, camera, camera, camera

perl4ever · on Aug 22, 2020

I'm not sure what you would expect as a response.

As far as I know, the reference was to a test for dementia in which some words were given at the beginning of the test and asked to be repeated at the end.

Perhaps you could provide context. Maybe there were five words, so you could say "Donald Trump was asked to recall five words to test his memory. Four of them were "person, woman, man, camera". What word did he forget?"

lacker · on Aug 22, 2020

I think this is precisely true. The difference is, I usually go days and say thousands of sentences without my words degenerating into nonsense. And usually if I say some nonsense it's because I'm sleepy or drunk, so if I'm trying hard to be not-nonsensical I can reduce my error level to under 0.1%. GPT-3's nonsense-rate is much higher. So, I think it is promising to simply scale up this model and see how much the error rate can be reduced. Perhaps a focus on simple quantitative improvement from here can lead to a much more human performance.

voces · on Aug 22, 2020

Pretty meta, but I thought it was relevant here. We are familiar with Brandolini's law:

> The amount of energy needed to refute bullshit is an order of magnitude bigger than to produce it.

This can be illustrated with math or logic statements. To refute the program "1 + 1 = 3" you need to, at minimum, state "1 + 1 != 3", and such a program is always lengthier. A fuller refutation could be "1 + 1 != 3, 1 + 1 = 2", more than twice as long as the bullshit statement.

What's happening here is sort of an inverse Brandolini's law: 35 world-class computer scientists use a massive amount of programming and compute to come up with a new language model trained on massive amounts of data. The trained weights don't even fit into memory. Impressive NLP progress.

Then Gary Marcus comes around and states "Not AGI!". Not one of the computer scientists stated that they delivered AGI. But some tech journalists did. So OpenAI is guilty by association. Even though Altman came out to temper the hype and expectations. That's like proving the Poincaré conjecture, and someone dissing your research, because "1 + 1 != 3".