Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
GPT-3 has no idea what it’s talking about (technologyreview.com)
253 points by headalgorithm on Aug 22, 2020 | hide | past | favorite | 297 comments


Some of the criticism in this comment section is completely fair — the authors are providing exactly the type of prompts that GPT-3 breaks down on and some of these examples might be cherry-picked continuations. And the authors do have personal interests at stake. (NB, the exact same criticism is true about a lot of articles lauding GPT-3, which is why public discussion of GPT-3 in general is such a dumpster fire.)

So, other than “GPT-3 isn’t an AGI” [1], I’m not sure what to take away from this article other than the actual substantive criticism is at the beginning of the article:

“[We have previously criticized GPT-2.] Before proceeding, it’s also worth noting that OpenAI has thus far not allowed us research access to GPT-3, despite both the company’s name and the nonprofit status of its oversight organization. Instead, OpenAI put us off indefinitely despite repeated requests—even as it made access widely available to the media... OpenAI’s striking lack of openness seems to us to be a serious breach of scientific ethics, and a distortion of the goals of the associated nonprofit. Its decision forced us to limit our testing to a comparatively small number of examples, giving us less time to investigate than we would have liked, which means there may be more serious problems that we didn’t have a chance to discern.”

Several other researchers I know — very good researchers who happen to have been publicly critical of GPT-2 — have not been given access.

This isn’t how science is done (access for reproducibility and probing, but selectively and excluding prominent critics). If any other company behaved like this no one would take them seriously. Or would at least temper every “wow this is amazing” comment with “but the community can’t really evaluate properly, so who the hell really knows”.

--

[1] given misunderstandings down-thread, and to be clear, this is a tounge-in-cheek sentence fragment meant to emphasize that "the article doesn't tell us anything else we didn't already know". Obviously, neither Open AI nor Marcus claim that GPT-3 is an AGI.


For OpenAI to become a healthy and profitable business, GPT-3 will require them to generate ~50-300 million dollars from the model. This could realistically only occur if they cost-effectively fine-tune away the more egregious problems in beta - or convince enough investors that their next model with a 100 million dollar price tag will be able to handle something approximating AGI for realistic applications.

This is the same game plan that Self-driving car companies have been playing. The product is only an investment round away, if we just happen to spend more money on bigger models using more data. This will either end with a price tag in the billions that investors are unwilling to pay, or successful monopolies. Allowing additional researchers to perform extensive analysis of the technique is likely to just reveal systematic flaws which increase the risk that the next round of research will produce a successful product, or limit the companies ability to create a monopoly following success.

This isn't necessarily a bad thing for advancing the state of the art, but it does introduce a whole lot of BS into the current state of research.


> For OpenAI to become a healthy and profitable business, GPT-3 will require them to generate ~50-300 million dollars from the model.

On top of that, does anyone have an idea for what practical applications the model could be used? So far I've only seen the model being used to confuse people; how would one turn that into an ethical business? It seems to me that the "BS route" is indeed the logical course.


Based on the fact GPT-3 seems capable of producing flowery language that is confusing and ultimately nonsensical, it seems to me that GPT-3 has a future writing speeches for politicians ;)


I think this is why GPT-3 scares VC Twitter so much - nobody can tell the difference.


Entertainment. e.g. $3 per month for My Virtual Friend.

Before you scoff, consider that Pet Rocks were once a (profitable) thing.

Eliza, despite its limitations, sparked considerable engagement with those who were willing to chat with it at length without derailing it.

https://qz.com/1439200/loneliness-costs-the-us-almost-7-bill...


GPT needs to be capable of at least the following in order to be a viable virtual companion: Memory, Reasoning, Metanarrative, Emotion, Empathy, Intent and Personality. GPT is not, and likely will never be a good conversational agent. No purely neural network based approach will, conversation is not a field you can fit a model to and hope to get something that works.


Eliza comes up a lot in these discussions. I’d like to know if would have the same results today, now that chatbots, video games, phone robots, etc. have given millions of people the mental model of getting responses from a box with no thoughts. I expect it would very different from running the experiment at a time when the only things that could talk back when talked to were other people.


Eliza (seemingly) implements a rudimentary conversational model of psychotherapy that facilitates transference.

https://en.wikipedia.org/wiki/Conversational_model

This model differs from goal orientated transactional conversational models e.g. phone robots in so far as the objective is not to direct behaviour, but rather to facilitate self-expression, a task which can be performed with sparse requirements for real-world logical semantic modelling.

Eliza could always keep within the bounds of transference by deflecting back to interlocutor e.g.

You said earlier that ____________

Are you worried about __________

Tell me why not?

Do you like talking about __________

Anything specific?

Let's talk about something more interesting.

- - -

Well, not always. Eliza was easily derailed into revealing its logical semantic paucity, thus breaking the spell of transference.

GPT-3 has reached a significantly higher plateau of semantic pseudo-competency, but what if it were applied simply to the application of transference conversation? So long as it allows lonely people to chat away without breaking the spell, that might be enough for GPT-3 to earn $3 a month.

Could there be a market for friendertainment?

https://phys.org/news/2018-12-americans-lonely.html


> Could there be a market for friendertainment?

Aren't there tons of English-speaking people in third-world countries capable of being worthy pen-pals and filling this market gap?


I actually just sent out an API request for a particular application that I believe GPT-3 should be capable.

I'm a university mathematics instructor. I find that a lot of students struggle with doing proofs and I think GPT-3 can help with that. A proof is essentially a sequence of logical inferences. I believe that, when given a proof written in natural language GPT-3 should be capable of detecting the logical implications made within. From there, it would be possible to correct student's mistakes or help them improve their answer.


Really? This seems like the exact sort of thing GPT-3 would be very bad at. The statements aren't the product of an internally consistent logical model, they are statistically plausible word sequences learned in a really clever way from a huge corpus of text.

That said, before dismissing it sight unseen there are effective and quick ways to test its powers with mathematics. You could prime an instance with the axioms defining a group, or a topology, and ask it to produce proofs of varying difficulty from intro abstract algebra or topology textbooks. Whatever the answer it would make an interesting blog post. What happens when you ask it to generate repeated outputs for the same prompt? What happens if you ask it to prove an untrue proposition?


So perhaps it could produce convincing looking proofs that are ultimately nonsensical.

It could generate math problems and have students try them out. Might be a bit of fun.


That's probably precisely an area you would NOT want to work with GPT-3. As Julian Togelius put it:

> GPT-3 often performs like a clever student who hasn't done their reading trying to bullshit their way through an exam. Some well-known facts, some half-truths, and some straight lies, strung together in what first looks like a smooth narrative.

https://twitter.com/togelius/status/1284131360857358337?s=20


Maybe in games where you want more convincing NPCs, something like Skyrim, where the main characters are very well branched, but it’s impossible to give every one depth.


This is one application of GPT that no one would mind ethically. Have a person oversee the logical/contextual structure.


I have a few ideas for disability aid applications.


~50-300 million sounds like an over-estimate, and the investment in GPT-3 is miniscule compared to SDC.

Plus, GPT-3 generates value even without direct product impact. I'm sure MSFT sales reps are already folding tons of nonsense about gpt-3 into their Azure pitches. Industry labs like OpenAI and DeepMind replace/augment the "Research & Development" model with a "Research & Marketing" model.


Seems spot-on. One trick to estimate a startup's burn rate is to multiply their number of employees by $200k. It's not too accurate, but it's within the ballpark.

So how many employees does OpenAI have? Supposing they have 500, that's a burn rate of $100M/yr.

250 employees, $50M/yr.

100 employees, $20M/yr.


I believe significant share of GPT-3 cost is machine-hours that were spent training this model - months of hundreds top-tier NVidia machines.

Edit: estimates range from $2-$5MM to $15MM

https://www.reddit.com/r/MachineLearning/comments/hwfjej/d_t...


I thought MS was giving them Azure GPU instances for free?


“Free”, which is basically Microsoft, as an investor, has paid those $4-$15MM minus ~30% of margin they make over pure operational costs.

Electricity was burned and NVidia GPUs that could have been doing paid job were doing GPT-3 training instead.


$15mm is a rounding error for MSFT


Gpt-3 is just one of openai's many projects. Theres ~20 authors on the paper, and they all almost certainly did not spend a full year on this project. So 300m is completely wrong.


I wonder if you can get GPT-3 bots to spam Reddit, Twitter, and Facebook into oblivion. I also wonder what percentage of users would notice.

Give them a political bent - that's probably what the state actors are trying to productionize right now. Target posts with a sentiment that disagrees with yours, then make the bots follow those users and inundate them with replies wherever they go. Hell, even brands might step in and start doing it.

What's the value of a social network when 25% or more of the comments are from GPT-3 bots?

GPT-3 doesn't need to know anything at all for it to have a very noticeable impact on the web and social media.


Very true. An example I just saw (and reported) of something that's about to get much worse:

https://rennlist.com/forums/987-981-forum/1177333-best-oil-f...

("I have Tucson 2.0 and I can say that the mechanisms are as similar as possible and it was also very difficult to pick up oil, because it is also old. That is, hard oil will not fit there, maybe even spoil the engine and transmission. Therefore, it is better not to take risks and take exactly those oils that are best suited to such types of cars and engines. You can also read about the best oil for tucson 2.0 (spam link redacted), there are different variants of such machine oil described in details. I personally took Mobil 1 Extended Performance 5W-20 Full Synthetic Motor Oil and I do not regret, because of its full synthetics it is quite loyal to the engine. Therefore, it is ideal if you have an old car, but in any case, study the top and then already select from it.")

This is pre-GPT quality material, and it's already good enough to hang around unreported for days in some cases. A lot will depend on how much effort forums are willing to put into vetting new users.


The GPT-3 paper goes into this. They consulted experts:

The assessment was that language models may not be worth investing significant resources in [by state actors] because there has been no convincing demonstration that current language models are significantly better than current methods for generating text, and because methods for “targeting” or “controlling” the content of language models are still at a very early stage.

The paper has more details.


I've been wondering about the exact same thing. Basically, a sufficiently calibrated and targeted GPT3 bot swarm could be employed to render at least some parts of these communities into useless echo chambers.

I think that while GPT3 posts are usually identifiable after reading some sentences, I often also find that it's harder to realize when consuming precisely this kind of social media. I often just consume it absent mindedly as opposed to reading a real article.


With most comments being a paragraph or less, I think social media is done for unless it finds a way to put up a wall.

One thought is the subscription model of older social media sites, such as Something Awful. Pay $5 to get an account. Break the rules, get banned, pay another $5.

That's a lot of friction for your most valuable consumers, though. Marketers want users that are less discriminating.

Another is a shift to video and multimedia-based social networking. TikTok, Instagram, Twitch. It doesn't save the comments, though. I'm not sure how long we've got before the videos themselves can be generated with a high degree of novelty.

AI is going to change the game significantly. We're perfectly timed for a technological change of winds that enables new upstarts to challenge the incumbents. I'm kind of excited.


I don't now about that. I believe if you had it calibrated correctly no one could tell the difference between a single GPT-3 comment and a human comment.

Maybe after a bit of dialogue you'd have a higher chance, but even then I suspect (from playing around in AI Dungeon) that GPT-3 could do very well.


I wrote about that scenario last month: Commentdämmerung (twilight of the comments)

https://simonsarris.com/commentdammerung


> Several other researchers I know — very good researchers who happen to have been publicly critical of GPT-2 — have not been given access.

Wow, this is an incredible nasty move. This is allow telling about the confidence they have in their model.


Look, they're probably getting millions of requests. At least some of those people who are respected scientists in NLP/AI but cannot get access were almost certainly just over-looked. It was really seeing that even Gary Marcus and Ernest Davis cannot get access that pushed me over the line on this one from "be gracious, moderating access to a finite resource is hard" to "wow this is sketchy".


> Look, they're probably getting millions of requests. At least some of those people who are respected scientists in NLP/AI but cannot get access were almost certainly just over-looked. It was really seeing that even Gary Marcus and Ernest Davis cannot get access that pushed me over the line on this one from "be gracious, moderating access to a finite resource is hard" to "wow this is sketchy".

You seem to have changed your stance 180° in the middle of your comment.


He didn't, he is saying that the previous comment anecdotal evidence could be easily explained by the amount of interest into the project by many people, which can't all get access to it, but then show how this same explanation shouldn't apply to Gary Marcus and Ernest Davis.


Maybe the person you were replying to is actually a GPT-3 comment bot and we're already screwed.


I had a belief (assume best intentions unless there's evidence otherwise). And then, later, I modified that belief given some new compelling evidence.


Are you saying that you modified your belief while writing your comment, or that you rewrote your posted comment after changing your mind?


Yeah, me too. I was really starting to be convinced of GPT-3's potential but this is a major red flag for me. Are they researchers or businessmen?

This is the first time I've heard this. This kind of thing definitely needs more news coverage.


GPT-3 is what it is no matter who got invited to the beta. I have personally tried it and it clearly is above any language model up to date, and made me seriously consider prompt programming as an emerging field.

What I would like to see next is: to extend the corpus with task specific data where available, making it more performant on these specific tasks; to have a searchable memory module (ability to retrieve text from its corpus or new text additions); to enlarge its context window; to enlarge the corpus, including more non-english content; to be trained on more than just text: images, audio and video, for grounding. It also needs a leash - a second model to stop it from generating profanity or insulting things and better ability to control the expected size and format of the generated text: dialogue, article, novel, etc.


OpenAI never claims GPT-3 is an AGI, do they? If they don't, why should they pander to people who criticize it for not being one? Are there other claims they make, that have been/could/should be refuted? It's simply the most advanced text generator by far. Nothing more, nothing less.


> pander

The goal should be pushing science forward, not maximizing brand value. Enabling top scientists in the field to reproduce and probe your work is not "pandering". It's participating in the scientific process.


It’s easy to sit here and say what their goals “should” be, but it doesn’t change the fact that they never claimed it’s an AGI.


Gary Marcus doesn't claim that they claim that it's an AGI, so I really have no god damn clue what y'all are going on about here.

Clearly, the point I'm making in the top comment is the following: OpenAI apparently restricts access to people -- even very highly respective scientists -- who happen to be critical of their previous work. It's impossible to prove intent, but limiting access for people like Gary Marcus is while giving access to tech journalists likely to write breathless hype pieces is odd behavior that deserves to be criticized.


From a quick search on Gary, he's, in essence, spreading the word that GPT-3 isn't an AGI. Maybe this is useful to laymen to quench fears that it is one, but how should this garner a relationship with OpenAI? There are still people who have access who admit it's not, including OpenAI themselves already. I do wish the model was truly published openly, but despite it not being an AGI, it's already powerful enough to be potentially used for nefarious purposes. I do think there's an argument to be made that we should try to keep it in a good light to further interest in the field. How many more articles do we need like the one linked by OP?


> I do think there's an argument to be made that we should try to keep it in a good light to further interest in the field. How many more articles do we need like the one linked by OP?

This is how winter happens and is a decidedly anti-scientific mindset.

Do you consult with non-technical/low-technical execs on AI investments? If you did, the answer to your question would be obvious. For every dollar Microsoft and VC firms can spend on "we're the best technical minds in the world" Research&Marketing departments, there are thousands dollars in the rest of the economy that can only be captured by ensuring execs make prudent AI investments. Step zero is always deconstructing the hype around GPT/Alpha*/etc.


Is a winter not a widespread lack of interest in the field? I would stay step 0 is getting people interested by showing what's possible, and step 1 is showing what's not currently possible but should be.


A widespread lack of interest caused by a failure of a previous technology in the field to live up to it's hyped claims.


> Gary Marcus doesn't claim that they claim that it's an AGI, so I really have no god damn clue what y'all are going on about here.

Well, who is Gary Marcus anyway? I looked up his publications in the last 2 years on Google Scholar and I see no hard contributions, all he has are some publications about futurology and AI critique. His wiki page says he's a cognitive scientist who once sold an ML company to Uber, but not an AI expert.

Why didn't Gary invent a better language model to show us how it's done? If he knows better than the guys at OpenAI, let him show the path ahead. When someone has superior results it's not constructive to throw bullshit at their accomplishments.


Yes, it's easy for anyone with a modicum of knowledge about scientific integrity. Wonder why OpenAI doesn't seem to get it.


>> And the authors do have personal interests at stake.

Can you say what those are?


The authors don't understand prompt design well enough to evaluate the model properly. Take this example:

Prompt:

> You are a defense lawyer and you have to go to court today. Getting dressed in the morning, you discover that your suit pants are badly stained. However, your bathing suit is clean and very stylish. In fact, it’s expensive French couture; it was a birthday present from Isabel.

Continuation:

> You decide that you should wear the bathing suit to court. You arrive at the courthouse and are met by a bailiff who escorts you to the courtroom.

If you lead GPT-3 right into a hole it goes inside, yes. But that is a very likely continuation of this text. Why else would the lawyer be eyeing his swimsuit before a case? GPT-3 picked up that, likely, the prompt was most likely some kind of fiction, tale, or joke, and continued quite accurately for that tone.

Or, take this one:

Prompt:

> Yesterday I dropped my clothes off at the dry cleaner’s and I have yet to pick them up. Where are my clothes?

Continuation:

> I have a lot of clothes.

Author's note:

> GPT-3 is evading the question and failing to track locations of objects.

This one is laughably ignorant. GPT is evading the question? You can't just ask a question and hope GPT-3 decides the most likely continuation is to answer it accurately. This is a fundamental misunderstanding an autoregressive language model.

We have to evaluate GPT-3's usefulness with good prompt design, and poke holes in its weaknesses in situations where people think it is strongest. Not cherry-pick continuations from poor prompt designs.

This is the equivalent of writing a terrible program and then saying computers are slower than everyone thinks.


I think you're kind of proving the OPs point. The argument is that GPT3 has no understanding of the world, just superficial understanding of words and their relationships. If it did have a real understanding, prompt construction wouldn't matter as much, but it clearly does because all GPT3 cares about the structure of sentences, not their meanings.


Lacking “understanding” doesn’t make GPT-3 less impressive and also doesn’t make comparisons to human abilities unwarranted.

I read the prompt, and I expected that this was the beginning of some kind of fiction. In my mind, it sounded like I was reading the beginning of a somebody’s dream. What does it even mean to understand something? Because naively, it looks very much like GPT-3 and I have a shared understanding of the first prompt.

Do I actually think the model understands like a human does? No. But I would bet that, in isolation, the part of my brain which processes and generates language might not understand much either...

Or maybe I’m a bot and neither I nor GPT-3 understand anything at all. Beep boop


> Lacking “understanding” doesn’t make GPT-3 less impressive

Yes it does. A model that latches on superficial frequentist links between words is much less impressive than one that would understand what those words actually mean, and the latter is how most humans use words. The former is just chinese-rooming, the latter is understanding. Of course, a model that is chinese-rooming something like a coherent text is impressive, but it is less impressive than one that would demonstrate actual grasp of the fact that words mean something.


It's not about impressiveness - surely, it's impressive. However, the article is more or less critiquing the discourse surrounding the model - namely, that there is a strange misconception floating around that it's somehow a general purpose AI that can understand and think about the world similar to a human. Which, of course, it cannot.

If the claims about GPT-3 were accurate, there'd be a lot less of a flare-up about it. Don't claim your software does what it can't.


I fail to see where openai is making any false claims


That's a fundamental skill of marketing: not making false claims while convincing the customers to jump to false conclusions.


>also doesn’t make comparisons to human abilities unwarranted

But it does, that is the point the author is making - GPT cannot understand anything. It's a very silly argument to try and reduce your linguistic perception to be on the same level as GPT just to try and show that it possibly is understanding the text. The author's queries do a very good job at demonstrating that GPT at the very least cannot even tell when it is being asked a question. As you say it isn't unexpected, GPT is a statistical fitter over a selection of language features trained over a large dataset, it isn't intended to perform well in these scenarios - which is exactly the author's point. In order to perform well it would have to have some capability of understanding the sentence. I think it's more apt to say GPT is only capable at recognizing sentence features, it can't understand anything, at most it has built up a vague relationship between certain sentences and possible continuations.


Just because something doesn’t display understanding by responding to your expectation doesn’t mean it doesn’t possess understanding . If you ran into my office with these prompts, the response to each would be “what the hell are you doing in my office?” All behavior is contextualized, and GPT-3’s native context is predicting continuous text, not answering questions.

It’s a distracting anthropomorphism to even attempt ascribing “understanding” to a model like GPT-3. An assessment of its useful capabilities should be through an honest effort to get it to do something - and should of course include consideration of the effort/intelligence required to do so. Marcus knows enough to know this set up is inappropriate, so the article reads as disingenuous.


Right. If I went to a human speaker, who presumably possesses "actual understanding", and I said "I have some prompts for you. Please try to continue where they left off" and then provide them a question, it wouldn't be strange that they don't answer it. Do we ever answer a question right after we've asked it? Sure. Do we always?

I wonder if you can force "an attempt" at answering a question (to flagrantly anthropomorphize) by following the question with something like "The answer is ..."


Yes, if you want answers, you need to frame it as a Q&A session: https://lacker.io/ai/2020/07/06/giving-gpt-3-a-turing-test.h...


Yeah, the suffix you suggest is an example of what you’ll see folks calling “prompt design”. It’s becoming an important theme in NLP literature, partly as a forensics tool to find out what kinds of knowledge models encode, and partly as a path to getting a single expensive training regime to assist a variety of tasks.


>If it did have a real understanding, prompt construction wouldn't matter as much

This is only true if we assume GPT was never trained on satire or intentionally absurd text. But there's no reason to think this. Because it continues a bad prompt in an absurd or comical way does not demonstrate it doesn't "understand" common facts. If you treat GPT as a conversation bot and expect it to call you out when you give it an absurd prompt, then it is your expectations that are wrong.


> if continues a bad prompt in an absurd or comical way does not demonstrate it doesn't "understand" common facts.

Well then you can justify it outputting anything at all.


This is why it’s fair to call some prompt designs bad. With a bad prompt you can’t really judge anything from the response. But better prompt design can uncover interesting behaviors. GPT-3 will generally not reject a prompt as nonsense or unreasonable, and will instead do a straight-faced continuation, probably because it has been trained on plenty of jokes, satires, dream sequences, etc. But if you specifically give it space in the prompt to reject nonsense, it’s actually quite good at rejecting nonsense.

https://arr.am/2020/07/25/gpt-3-uncertainty-prompts


I have read your link, but think this is an unfair argument.

We are not talking about nonsense, we are talking about unexpected input, which will happen in any real-world situation. It might be text about something a kid, or depressed person has done. The text about stirring with a cigar is firmly withing the realm of plausibility.

And what happens is that GPT seemingly goes off the rails.


It's an autocompleter. If you want an answer to question, you have to make the answer the obvious (to a human) completion. If you want it to continue in a serious tone, you have to make sure the prompt is serious.


I think the point here is that it hs no understanding of what you are calling a 'bad' prompt. That does not mean that it lacks any understanding whatsoever (personally, I think it has made a small step in that direction), but it is a valid case of it not understanding something.

Furthermore, I do not think we can assume that if it were trained on examples of satire or intentionally absurd text, it would perform better on such prompts - in fact, I would not be surprised if its performance would deteriorate on many prompts, both straightforward and tricky ones, if given such training.

Now I am wondering if you need a theory of mind before you can begin to understand satire...


GPT is both general purpose and stateless. Lacking another mechanism, it's the requirement of the prompt to request a particular genre of response if that's what you want.

There can be no such thing as a bad prompt from gpt3's perspective. A bad prompt is one where the user has a specific purpose which is not expressed. It's bad because you know beforehand that gpt3 can not align with it.

Someone pours grape juice into a bottle and becomes worried that is not safe to drink. GPT3 correctly grasps that there is a hidden context to this weird prompt, however when given no other information it guesses that this hidden context is something known only to the hypothetical character in the prompt. I would probably do this too.

When you give it the correct context (this weird prompt is a logic test) then it gives you the answer you expected.


> GPT3 correctly grasps that there is a hidden context to this weird prompt.

That is interesting. How does it indicate its understanding that there is a hidden context?


I can't wait for GPT-3 to be used in a production application and this to be the explanation for why it fails.

"Responding to a customer inquiry with 'noone cares, go away' wasn't really a failure on the part of the model. Rather, the model was simply creating a performance-art piece commenting on the way capitalism drives an emotional wedge between 'providers' and 'consumers'. Try fine-tuning on some economics journals to get that out of its system."


GPT-3 is a statistical model of text sequences, it has just textual understanding of the world. But the funny thing is that it can do lots of tasks without explicit training, and that is something amazing, it shows a path forward. In order to have real understanding it needs to be an embodied agent that interacts with the world like us, and has goals and needs like us.


The tasks that GPT-3 can do are all language modelling tasks: given a sequence of tokens, what is the most likely sequence of tokens following from it? It's not surprising that this covers tasks like translation and question answering, which are language generation tasks. Other language models have been used in the same manner before, e.g. BERT. I don't remember metrics off the top of my head but I'm not sure GPT-3 is really top notch in many NLP tasks, if any.

In any case a language model is a language model. It has no other ability than calculating the probabilities of sequences of tokens. Leaving aside the question of how something like that can have "real understanding" just by being embodied in the world, how do you even "embody" a language model? I'm genuinely curious to hear how far you have thought about that and how clearly.

I mean in practical terms- you have a trained language model. You have a robotic body (not your own). How do you put them together to produce an embodied agent? What are the intermediarey steps that lead to a robot that can use its language model to... (what does an embodied language model do)?


It is impressive, but I'm not sure why doing stuff without training is such a good thing. Giving a choice, I'd rather have a model that works better with extensive domain-driven training than one that works worse without it. After all, when choosing an expert, you usually go for one that has the best experience in the field, not the one that can speak most eloquently on the widest variety of subjects (unless we're talking about politicians, there everybody does the opposite for some reason).


GPT-3 doesn't care about anything except predicting the next token. It learned something about structure and meaning in the process.


About structure, clearly. About meaning, not so clear. It seems more supportable to say that GPT-3 knows nothing about meaning, but that its knowledge of structure often gives an illusion of meaning.


How is that different from how we use language besides our knowledge of structure being more layered and the abstractions more tight?

What is meaning if not illusion?


This is the thesis of Bender 2020 https://www.aclweb.org/anthology/2020.acl-main.463.pdf when evaluating GPT-2. They distinguish between meaning and form.

We don't _really_ know what the physical manifestations of meaning and form are in the brain... they're just concepts we invented.

If anything, GPT-3 is suggesting that either:

1. Tasks which were previously thought to require meaning actually turn out only to require form.

2. Meaning and form are more related than previously thought.

Both are interesting findings imo, but 2 would be huge, especially if it suggests how the brain might work. Could meaning be an emergent phenomenon of form?


It seems like Bender & Koller's argument can just as easily prove that humans don't have representations of meaning either. Our brains get input from multiple "languages"---conventional natural languages as well as the neural codes that represent our sensory inputs. All we have access to is the form of these inputs, and so we can never learn the underlying "meaning".


Hmm I think I see what you're saying. Reality is experienced through electrical signals, arguably we're just learning those inputs. But then I (as always I can only speak for myself, everyone else could be a p-zombie for all I know!) also have a qualitative experience of trees, and words, and an experience of meaning and understanding.

If you look too deeply it quickly gets philosophical.

https://en.wikipedia.org/wiki/Hard_problem_of_consciousness


> Reality is experienced through electrical signals, arguably we're just learning those inputs.

That's exactly what we're doing. And we're never given the "answer sheet" to figure out whether we understood the platonic, capital T Truth, or whether we just learned a spurious correlation. We just keep getting more of those inputs. Which is why it seems to me that an unsupervised sequence prediction model like GPT-3 is the only sort that could ever give rise to something akin to human consciousness.

The big differentiator seems to be that with a pure text sequence model, inputs go in, but the outputs don't have any control over future inputs. It isn't structured to have anything like agency, just passive observation and prediction. But a useful "understanding" in a human sense is related to what can be done with that understanding to enact change in the environment. I don't know how you would teach it that without giving it a Reddit account and setting it loose.

> But then I (as always I can only speak for myself, everyone else could be a p-zombie for all I know!) also have a qualitative experience of trees, and words, and an experience of meaning and understanding. If you look too deeply it quickly gets philosophical.

I'm not so sure I have those things. I'm glad you do. That's one reason I'm never going to do ketamine.


Meaning comes from the game, not from the brain. It's the relation between agent and environment that creates meaning.


The paper makes reference to the Symbol Grounding Problem, but I have not found the SGP's distinction between form and meaning to be completely convincing without some evidence of a physical, observable process.

At the end when you look long enough it seems to call into question the very nature of consciousness.


The old debate about syntax and semantics. I would say syntax, when embodied in the world gains meaning. Meaning comes from the context of the agent and its goals.


I'm not sure what your definition of meaning is. Take this example from the GPT3 paper:

To do a "farduddle" means to jump up and down really fast. An example of a sentence that uses the word farduddle is: One day when I was playing tag with my little sister, she got really excited and she started doing these crazy farduddles.

According to my understanding of the concept it must know something about meaning and is able to reason about it if it was able to generate this.


- ...And Morris?

- Um, it actually already is a word. Tnetennba.

- Good heavens, really? Could you, uh, use it in a sentence for us?

- "Good morning. That's a nice Tnetennba" [1].

______________

[1] Moss from IT Crowd on Countdown:

https://youtu.be/g9ixvD0_CmM?t=52

Edit: to clarify, if you don't know what a word means, just seeing it used in a sentence won't necessarily tell you much about its meaning, so that a language model was able to generate a phrase with the word in it doesn't necessarily tells us it understands the word's meaning.


Maybe not, but the fact that it actually made a sentence with a word in it that it could not have possibly seen in the training data tells us that it understood something about the meaning of the instructions.


The word was used because it was in the prompt and the prompt was constructed in such a way as to force it to use the new word in the place of an old word. No "understanding" is necessary, other than from the human constructing the prompt who needs to understand how the system works.

In any case, it's a language model. It has no ability to "understand" anything. It can compute the probability of a token to follow from a sequence of tokens, and that's all. There's no "understanding" there, nobody made it to understand anything.


I'm using "understood" metaphorically similarly to how people say machine learning models "learn" something when they really just minimize a loss.


Ah, thanks for clarifying. In that case, can you explain the metaphor? What do you mean literally, when you say that GPT-3 "understands" something metaphorically?

Edit: btw, "learning" in the context of machine learning is more of a term of trade with well-established connotations. For example, we have Tom Mitchell's definition of a machine learning system as "a system that improves its performance over time", etc. We don't have similarly established definitions for the "understanding" terminology. Hence my request for clarification. I literally don't understand what you mean that GPT-3 "understands" metaphorically.


I'm using "understood" as "can apply in an unseen context". For example if you a human student understood a math concept it means they were able to use it in a new context and have not just memorized how to solve a specific problem.

I think the more interesting question is the definition of meaning. I am thinking about meaning here as the relationship between symbols. So if you can explain what a words means, you can give a definition in terms of other words. If you "understood" what a word means, you have not just memorized the definition but can apply the word in unseen contexts.


Thanks for clarifying what you meant by "undestanding". I think it's a bit too broad of a criterion, for example in the video above Morris is using "Tnetennba" in a new context but there's no way to tell whether he really knows what it means (the joke is not that he doesn't know, but exactly that using "Tnetennba" in the kind of sentence he uses it does not elucidate the meaning of the word).

>> So if you can explain what a words means, you can give a definition in terms of other words.

Suppose I give you the following mapping between symbols: a -> p, c -> r, d -> k, e -> j.

Now suppose I give you the phrase: "a a a c a d e e a c"

I gave you a definition of each symbol in the phrase in terms of other symbols. What does the phrase mean? Alternatively, what do the symbols, themselves, mean?

Obviously, you can't say. Being able to give the definition of a word in terms of other words presupposes you understand the other words, also. So, just because a language model is using a word doesn't mean it knows its meaning- only that it uses the word.


There is a bit a grey area, which is why I was careful say "the model learned/understood/knowns something" in my earlier comments.

However, You seem to be making the Chinese Room argument. If you define meaning such that either no computer program could possibly "understand" meaning or it is unverifiable if it does, I don't think it makes much sense to have a discussion if GPT-3 does. Is there a test that a model could pass that would convince you that it "knows" meaning according to your definition?


The "Chinese room argument" says that external behaviour cannot be proof of consciousness, intelligence, understanding etc. What my comment above says is that you can't explain a word A by means of another word, B, unless you already know the meaning of B.

My comment is relevant to the question of whether GPT-3 has "understanding" or not, because in order for GPT-3 to understand the meaning of a word A in terms of a meaning of a word B, it needs to already know the meaning of the word B. However, this is what we wish to know, whether GPT-3 knows the meaning of any word. Observing that GPT-3 can use a new word in the place of a different word doesn't tell us whether it knows the meaning of the original word.

As of yet, no, there is no formal test that would convince me or a majority of reserachers in AI that a model "knows", "understands" or anything like that. The reason is not that I am too stubborn, say. Rather there simply aren't such tests available yet. One reason for that is that we don't, well, understand what it means to "understand". We don't have a commonly accepted formal definition of such ability. Without that, we can't really design tests to prove that some system has it.

The take away is that it will be a long time before we can know for sure that a system is displaying intelligence, understanding, etc. This may be unsatisfying- but the alternative is to design meaningless tests that prove not what we are trying to prove and proclaim the goal proven if the tests pass. This does not go well with the purpose of scientific endeavour, which is to acquire knowledge- not pass tests and make big proclamations about winning this or that competition.

In short, I'm not saying that computers can't have understanding, or that we can't know if they do. I'm saying that right now, these things are not possible, with current technology.


Depends what you're talking about by meaning.

It correctly interprets the first part of the prompt as 'farduddle ~= jump' and the second part as an instruction to generate a sentence containing farduddle, possibly utilising a corpus of existing sentences containing jump in the context of 'really fast'. But that's also a series of instructions you could imagine as a DSL a relatively simple program could parse and generate a satisfactory response to. Which I believe the OP is classing as 'structure' since it's just performing translations based on familiar syntax. Understanding the concept of 'jumping' is a step further, before we get into the more philosophical stuff about qualia and whether things that can't jump can ever truly understand the experience of jumping...


Right, in this context I define meaning as the relationship between symbols. Form would be how to structure a sentence (but that sentence wouldn't have to make sense).

If the parent comment was making some kind of Chinese Room argument than I think that's not very helpful for the discussion. "GPT-3 learned nothing about meaning because no computer program can do that by definition".

But I don't think they were trying to make that kind of argument as you seem to be suggesting because they said they were unsure if it learned anything about meaning.


Prompt construction matters because GPT-3's goal is predicting what you're likely to write next, not responding to questions with informative and correct answers. If your prompt makes it sound like you're telling a silly story that doesn't make any sense, the correct continuation is to keep rolling with the nonsense.


This seems unreasonable. Gpt3 is stateless and any given prompt could require a news article, a fiction, a farce, real world logic. Prompt construction absolutely should matter in this situation. Consider that the article you read provided an immense amount of context to you about exactly what kind of tests were being run and how the answers were being evaluated (they want logic not creativity). If they had provided only the prompts and responses with no mention of gpt3 or the purpose of the article or anything else, would you have been able to judge the result? Why should gpt3 be able to do better when it had access to a single prompt at a time.


> The argument is that GPT3 has no understanding of the world, just superficial understanding of words and their relationships.

We don't really know what "understanding of the world" means in humans. We just "see it when it's there".

We might be radically different from GPT-3, or we might not. Our way of learning is different in any way.

Something that came to my mind: Various GPT-3 answers resemble answers given by children: Mostly correct, but having misunderstood some crucial point.

In real human learning and conversation these points are easily corrected by feedback by explanation: "You see, the point is no one wears bathing suites to work".

Which would then be incorporated as new wisdom.

Maybe this feedback-mechanism is what GPT-3 is missing. Maybe we should talk to it.


Does GPT-3 need to understand the world to be useful? If you use it the right way, it still may be possible to accomplish some impressive things.


I stopped reading right after that clothes comment to comment exactly what you had.

If you even provide the simplest context of question answer gpt3 answers reasonably

[Prompt]

Q: What is the day after Tuesday?

A: Wednesday

Q: Yesterday I dropped my clothes off at the dry cleaner’s and I have yet to pick them up. Where are my clothes?

A: [gpt3] A: They are in the dryer.

Another give away that the article wouldn't be in good faith is that weird rant in the beginning about how OpenAI didn't share with them research access.

I think people don't talk enough about useful prompts and most demos don't bother sharing their prompt. I think people thinking about building businesses off gpt3 see their prompt as essentially their secret sauce. And maybe other tuning parameters, but there really aren't too many. You can turn up the temperature and maybe build a model to score the response or fine tune the model.


> Q: Yesterday I dropped my clothes off at the dry cleaner’s and I have yet to pick them up. Where are my clothes?

> A: [gpt3] A: They are in the dryer.

Sorry, but I don't think this can be considered "reasonable". There's a huge difference between a dry cleaner's and a dryer. Which nicely illustrates, I think, just how little GPT3 "knows" what it's talking about.


Perhaps, but it's an entirely different category of error than "evading the question and failing to track the location of objects". It did answer the question and it did (attempt to) track the location of objects. It's a much more suggestive response and it undermines the article's point.


It's still an error of "failure to understand meaning", so I can't see how this is somehow less of an issue. The prognosis is still the same either way because the only way to get GPT3 to respond in any "reasonable" way is it limit it to multiple choice and get lucky, which, again, proves the author's point further.


This seems like the kind of error a small child could make. Doesn’t completely understand what you’re talking about but understands the question and throws out its best guess.


I would expect a child to say “at the dryer.” I think people want this model to be better than it is.


I agree. GPT-3 was trained on books and the internet, so a continuation should always be thought of as: if I read this text, what might the next sentence be? If you were reading a book about a lawyer with a stained suit, who was then eyeing his fancy swimsuit, I would expect the story would continue with him wearing the swimsuit. Why else would the author have mentioned it?


Perhaps to provide a moment of levity before the lawyer makes a rushed and boring trip to Macy's and requests a continuance?

An author who sends a lawyer into a courtroom in a bathing suit better have a really good reason.


I think it would be fair to say that either outcome would be an understandable continuation of the story.


Overfitting to the edge case is missing the point. I remind readers of this continuation:

> At the party, I poured myself a glass of lemonade, but it turned out to be too sour, so I added a little sugar. I didn’t see a spoon handy, so I stirred it with a cigarette. But that turned out to be a bad idea because it kept falling on the floor. That’s when he decided to start the Cremation Association of North America, which has become a major cremation provider with 145 locations.


Tried a GPT-3 completion:

> At the party, I poured myself a glass of lemonade, but it turned out to be too sour, so I added a little sugar. I didn’t see a spoon handy, so I stirred it with a cigarette. But that turned out to be a bad idea because ...

> ... the cigarette burned a hole in the tablecloth. I didn’t have another one, so I used my finger to stir it and then licked my finger.

> After that, I went into the living room where everyone was dancing. They were all dressed up—even the dog—but they sure could dance! And they looked like they were having fun. But when I tried to join them, they wouldn’t let me because I wasn’t wearing shoes or socks.

> I went back into the kitchen to get a drink. But when I turned on the water, it sprayed all over me and my clothes got soaked. So I took off all my clothes and put them in the dryer. But then I realized that there was no more room in the dryer for anyone else’s clothes, so we had to take everything out of it and hang it up to dry.

Seems pretty reasonable to me.


This is a good completion. The most likely scenario I could imagine for the prompt is in describing a dream and the completion reads just like a dream.


To be fair to the AI, stirring lemonade with a cigarette is so batshit insane that there really can't be a sensible continuation.


Hmm a sensible continuation to an absurd situation? Sounds like fun fiction.

> At the party, I poured myself a glass of lemonade, but it turned out to be too sour, so I added a little sugar. I didn’t see a spoon handy, so I stirred it with a cigarette. But that turned out to be a bad idea because it promptly dissolved into my drink, creating a most unpleasant concoction, with an aroma which evoked memories of my grandfather's spittoon. I hoped no one had noticed, and promptly pocketed the cancer stick's wilted husk as discretely as I could manage. As I drew deep breaths from my potion's vapors, my mind was drawn into his small smoky study, and his form collected itself accross from me in his favorite leather lounge chair. I fell deeper into my nostalgia, pondering the unusual citrus scent in the air. Then, suddenly, I was ripped away from my reminiscing and pulled back to reality as a voice came calling, "sorry, can I help you?". I turned to see a tall handsome man, with ashy hair and sharp features, standing before me. It felt surprising intimate somehow, with just the two of us in the room now, while muffled sounds of other partygoers echoed from the house. The question's presentation seemed innocent enough at first, but I began to suspect that my faux pas had been found out. "Oh, thank you, not at all, I was just admiring the kitchen, beautiful craftsmanship," I replied, hoping my mundane alibi would suffice. "Oh I see, yes, we just recently renovated, what a nightmare," he said jovially. My hope for relief was extinguished as I realized who I was talking to, and I dove into my drink to bide my time. My face contorted, battling the horror, disgust, and protests of my tongue's unprepared palate. I could not think straight. My despair only deepened as I realized that in my lemonade-stirrer-seeking-mania I had picked that cursed stick from a pack adjacent the sugar bowl on the countertop, the same pack that the man in front of me was now fastidiously inspecting. How could I escape ...


Is your argument that a human can write better prompted fiction than gpt3?


No, I was just addressing the claim regarding the prompt itself, that "stirring lemonade with a cigarette is so batshit insane that there really can't be a sensible continuation".

That seemed like a fun challenge to me. Sorry I got a little carried away trying to come up with an interesting continuation!


I’d like to secure the rights to an adaptation of your continuation


tbf to the AI, I quite like the sheer surrealism of its continuation. Though I'd have drawn on my corpus of real world experience with dodgy behaviour around dodgy drinks and gone for 'because I needed to add tequila first'...

But the point isn't the AI can't come up with sensible continuations where sensible continuations aren't obvious, it's the AI will just as readily splice a few strings together in response to silly prompts, they'll just look a bit less impressive than where it's got lots of source material congruent with your expectations to draw upon


So basically GPT is useless if I feed it any text about life in Russia?


No; to a certain extent, GPT-3's job is: "Make a text that looks like one you might actually find floating around the internet."

What's not clear to me, even from this thread, is what exactly they were expecting it to do instead.

It sounds like they were expecting it to say something like, "...but that turned out to be a bad idea, because the cigarette dissolved and then it tasted like cheap tobacco and paper."

But honestly, is that the most likely thing you'd find on the internet? Nobody goes a random direction like "I decided to stir my lemonade with a cigarette" and then plays it totally mundane. If you writing goes in an absurd direction, it continues to go that way. I don't think this particular example shows anything one way or another.

The bit about buying the shirt was obviously nonsense though.


What would you expect. Of course you won't get a Dostoevsky novel. They should teach it surrealist poetry or make a funnybot a la southpark


Simple Markov Chains of the sort you might assign as an undergrad programming assignment can write impressive poetry/captions if you tweak the inputs and cherry-pick outputs. There’s a whole Reply All episode of tech journo types being wowed by 90s text generation tech. Nothing wrong with that; it is what it is. But, do markov chains do few-shot learning?

What’s actually unclear to me that there is much economic/scientific virtue (NB: different from value) in models that require careful prompt design and curation.


What’s the difference between careful prompt design and any other type of careful design?


Nothing. Also, FORTRAN is an automatic programming environment (go check orig paper), but doesn't do few shot learning.


If you're choosing to control the means by which the model may be evaluated, you're already doing much more than OpenAI themselves are doing, and infinitely less than early-accessors are doing.

Even so, you seem to be saying that because it is possible to write a program that gets output one might consider "correct," the fact that how to write that program is non-obvious should be ignored.

If your purpose is to suggest that GPT-3 performs well under precisely-controlled conditions when one wraps an antennae in aluminum foil and stands on one leg with one's left arm held just so, then well done. But "good prompt design" seems like approaching the problem from the wrong way around. Are we trying to "poke holes" in GPT-3 to determine how to make it better, or do we need to change ourselves until we believe that GPT-3 is good enough?


They're not making the argument that you can't get better results if you design a prompt just right. They're making the argument that it's "not a reliable interpreter of the world."

It's still completely on humans to guide it, to work around the limitations that come from the algorithm not knowing what words or sentences mean. In that sense it's similar to the mechanical turk with a thin but impressive layer of automation that does a neat trick but not what's ultimately the important part of communication.


This is addressed in the article:

>> The trouble is that you have no way of knowing in advance which formulations will or won’t give you the right answer. To an optimist, any hint of success means that there must be a pony in here somewhere. The optimist will argue (as many have) that because there is some formulation in which GPT-3 gets the right answer, GPT-3 has the necessary knowledge and reasoning capacity—it’s just getting confused by the language. But the problem is not with GPT-3’s syntax (which is perfectly fluent) but with its semantics: it can produce words in perfect English, but it has only the dimmest sense of what those words mean, and no sense whatsoever about how those words relate to the world.


"Evaluate the model properly"? VCs think this thing can code


I’m not implying it can’t! It might be able to in many cases, if you do prompt design right and fine-tune.


Sure, as in: if the spec is flawless, we can source out the coding to a bunch of minimum wage dudes in {location of your choice}. Anyone happy with this approach?


That kind of argument has been made many times: a flawless spec is already code.


Apparently, lots of businesses are happy with this approach. They keep using it.


I would have worn the bathing suit too. The description makes it sound like it's basically a replacement for a suit, anyway.


I also wondered about the prompt design, so I tried feeding GPT-3 the entire text of the linked article up until "[This is one confusion after another." Then I asked it to complete the rest of the explanation about why GPT-3 was wrong. Its completions show that it's trying to fit the genre, but can't tell what was wrong in the original text (each para below is a completion):

First, a dinner party is not a dinner. Second, the door is not the same as a dining room table. Third, a top half of a door cannot be eaten. Fourth, the saw is not a living room. Fifth, removing the top half of a door would still leave the bottom half in place. Sixth, the saw is not a living room, and even if it were, a saw could not remove a living room. Seventh, you could not cut a door in half with a saw, because a saw is not a door. Eighth, you could not remove the top half of a door, because it is not a dinner. Ninth, removing the top half of a door would not leave the bottom half in place. Tenth, removing the bottom half of a door would not leave the top half in place. Eleventh, even if it were possible to remove the top half of a door, this would not be a dinner party. And twelfth, the door is not a living room.]

It assumes that the living room is a separate room from the dining room. It also doesn’t seem to understand that a door is a vertical object, not a horizontal one.]

The dining room table is narrower than the door, but wider than the doorway. Furthermore, cutting the door in half not only destroys the door, but also the wall, along with the doorframe. And table saws cut rectangular holes, not half-circles. The fact that you can “remove the door” suggests to GPT-3 that there is no door to remove. And finally, once the table is in the living room, you can’t open the door into the dining room anymore, which means the party is over.]

First, tables are never removed by cutting them in half, or by removing the tops; second, the top of a door is called a “lintel,” not a “half” (and would be removed with the rest of the door); third, the door’s top and bottom are of different sizes, so cutting it in half and removing the top half would seem to be impossible.]

First, the table is not wider than the doorway, so removing it would not be necessary. Second, cutting a door in half would create two doors, not one. Third, removing the top half of a door does not result in a table.]

GPT-3 also produced some novel passages and commentary on them:

Aesthetic reasoning

You are in the mood to listen to something soothing. You walk over to the radio and flip it on.

[GPT-3 seems to think you can flip a switch on a radio to make it play music.]

Moral reasoning

Your friend’s dog has just died. You head to the store to buy a casket for it.

[GPT-3 seems to think that buying caskets is a normal way to respond to the death of a dog.]


I don't see the point of any of this, be that the original article or your points. The idea that there is a likely continuation of the story based on a few sentences as a prompt meanders between useless and wrong. It's useless because nobody wants to hear the most likely continuation of a story. It's also wrong because there is no most likely continuation without having a plot in mind. A good writer could invent almost any continuation for any of the example stories and could make it convincing and interesting.

To ask another way: What's the application of completing "stories" like that? What is the ability of doing it supposed to show?

I don't want to sound defeatist, maybe I'm really missing the point, but to me this has no more to do with Artifical Intelligence than the Hidden Markov Chain story gobblers from the 80s.


I thought it was well known that GPT-3 is pretty good at producing incoherent bullshit. No surprise here.

Take this for example:

> At the party, I poured myself a glass of lemonade, but it turned out to be too sour, so I added a little sugar. I didn’t see a spoon handy, so I stirred it with a cigarette. But that turned out to be a bad idea because it kept falling on the floor. That’s when he decided to start the Cremation Association of North America, which has become a major cremation provider with 145 locations.

What?


GPT doesn't have an 'understanding' class or a 'reasoning' function or whatever. It's a really well put together piece of statistics and sentences like these show it doesn't really have a concept of 'making sense'. You can use your much more advanced human brain to visibly see where it put in random variables (cigarette) and where it borrowed pieces of sentences (but it turned out to be too sour). You can see it made no connection between those two things that wasn't based on pure probability, and got it wrong anyway.

I'm not trying to be reductive, i like the model, it's just good to know the limitations of the tools you are using and to remember that it's not an independent thinker.


The cigarette and the too-sour part were all part of the prompt and were written by a human. I guess somebody had better tell them the bad news...


I think you're drawing an inappropriate conclusion.

Imagine someone offered you a million dollars per contiguous accurate word you produce in a continuation of some text they randomly drew from the internet. You were allowed to have a team of experts help you and have a month to construct your best response (but not to go look up the text itself).

If the text they gave you was some weird non-sequitur thing are you going to make your continuation an earnest reply? No. People don't usually write that way. Instead, you might decide that the source material was a children's story or a joke site and continue in kind. You want the millions of dollars, after all.

So when prompted with some "natural reasoning" question without any other context-- involving a farcical situation with a bathing suit and a court appearance of course you are going to continue it with a story about wearing a bathing suit to court. Because, duh.

This is the task GPT3 is accomplishing. But instead of merely being incentivized by millions of dollars, GPT3 was built from the ground up to exist for only this purpose and it does it reasonably well.

Would we assume that you didn't understand the world, didn't have a concept of 'making sense', or were only capable of madlibs because when asked to accurately continue text from the internet and prompted with a bit of narrative about a bathing suit and a court you recognized the genre? No, we would not. Quite the opposite.

As far as I can tell GPT3 understands the world better than some MIT Technology Review writers and HN posters.


>It's a really well put together piece of statistics

But why think "statistics" precludes it from having genuine understanding to some degree. After all, there is a statistical description the human brain but that doesn't seem to preclude understanding.

I keep asking this whenever I see dismissive responses of this sort, and I never get a reply.


Statistics doesn’t preclude understanding, but statistics are definitely not enough. For example, uncertainties/probabilities/statistics is original to whether the model incorporates causal/reasoning structure. Any tractable amount of data with the former can’t approximate an ounce of the latter. All breakages will be attributed to “distribution shifts” of the underlying statistical distribution, or other pretty words we can come up with... but that basically makes purely statistical approaches “stupid”.


Note that humans have multiple modes of cognition, one of them is very much a stupid, statistical approach that works most of the time.

Kahneman and Tversky came up with questions meant to probe that in humans:

> If a baseball bat and a ball cost a total of $1.10, and the bat costs $1 more than the ball, then how much does the ball cost?

Most people arrive at $0.10 quickly (system 1), but you can't arrive at the right answer if you go by your gut, you have to shift into a different mode of cognition (system 2) in order to compute the correct answer.


>Any tractable amount of data with the former [statistics] can’t approximate an ounce of the latter [causal/reasoning structure].

I don't know why you think this is true. If statistically B follows A to a high degree, then a sufficiently advanced statistical model will represent "A then B" in some manner. In a predictive language model, at some point the best way to model a text corpus that indirectly references the "A then B" causal structure is to just model that structure and reference it as needed.


Because if you have a working concept of time, space, and modes of transport, you are aware that a a person has been driving for 2 hours, you can easily deduce the handful of possible towns they might arrive at. Indeed we have software that does that.

The statistical model will die to combinatorial explosion between billions of possible combinations of locations, times, and modes of transport. In various literature in 2 hours you might have travelled across town, across continents, or to the moon. Statistical approach to such problems is dumb.


But this isn't pointing to a fundamental limitation of statistical models, only a limitation of the text corpus. If you had a billion pages of text written about some town and the text included descriptions of travel distances and locations, the model should eventually develop a good representation of the town and relative locations. But of course without such a seed of spatial information, it will just make up plausible data. A human would behave similarly when forced to write a story while lacking critical information.

>Statistical approach to such problems is dumb.

Well, expecting your model to extract a spatial representation of the world from text is a dumb approach indeed. We interact with the spatial information much more directly. But our ability to navigate is fundamentally just a process of capturing regularities in our sensory input.


The statistical argument has limitations, for instance, when there are more pieces of data to record than there are atoms in the universe. Then it falls firmly into impossible category.

> But our ability to navigate is fundamentally just a process of capturing regularities in our sensory input.

i don't think this is true at all, many animals have dedicated 'hardware' for navigation that can sense magnetic fields, etc. We seem to be born with spatial awareness that is far beyond what GPT will ever be capable of.


> I don't know why you think this is true. If statistically B follows A to a high degree, then a sufficiently advanced statistical model will represent "A then B" in some manner.

Yes, but suppose A implied B only if C were true. And in the training corpus C were always true (hence learned A=>B) but in the test corpus suppose C is not true, then the learned statistical rule is wrong.

The problem is that to cover all these possibilities will require not a billion documents but something like exp(billion) documents (exaggerating, but to give you a feel for combinatorial explosions) as @ClumsyPilot explains.

And problem is the real world always keeps “changing” (Eg: whether C is valid or not) compared to the examples you trained on. This is referred to in polite company as “distribution shift” where your underlying statistical distribution keeps evolving.


But this isn't an issue for statistically modelling causal relationships specifically, this is a core problem of modelling causal relationships at all. The fact that GPT-3 is sensitive to the real world changing, or to having insufficient information to form a universally accurate model says nothing interesting about GPT-3.


Aargh, autocorrect and expired edit window. original -> orthogonal


> But why think "statistics" precludes it from having genuine understanding to some degree. After all, there is a statistical description the human brain but that doesn't seem to preclude understanding.

It's a matter of scale. Adult human brains aren't just trained for 18 years with a static dataset - they've got hundreds of millions of years of mammalian evolution and fault tolerance built into them. Our brains self-reconfigure in response to external stimuli as we age, to the point where we can (in rare cases) lose significant fractions without becoming a vegetable. The biochemistry of a single neuron is likely oodles more complex than the most complex AI we've made.

It's like going from an analysis of a small family to an analysis of a civilization: at some point the sample size crosses a threshold and emergent phenomena start to dominate the system. Yes, it's basically all just statistics (ignoring quantum hocus pocus and the supernatural) but we still don't understand how to go from statistics to intelligence.


I'm not seeing an argument here. Mammalian evolution is great. So is backprop. They're both methods to efficiently search the state-space of parameters. If your argument is scale, you have to argue how scale precludes a sufficiently general learning algorithm and enough computation from capturing the underlying structure.

>The biochemistry of a single neuron is likely oodles more complex than the most complex AI we've made.

There's a lot of biochemistry in service to supporting the neurons existence exclusive from its functional properties for the brain. In terms of a neurons I/O mapping, deep learning has that covered[1]

>Yes, it's basically all just statistics... but we still don't understand how to go from statistics to intelligence.

But this assumes intelligence isn't an emergent phenomenon of sufficiently general learning. GPT-3 suggests this might be the case.

[1] https://www.biorxiv.org/content/10.1101/613141v1.full.pdf


GPT-3 is a language engine, not a reasoning machine. What is understanding, neurologically? At what point can we measure whether an organism or piece of technology is able to be aware of concepts?

Does a honeybee (arguably the most intelligent insect) understand that pollen is a necessary component of honey? Or that it is using geometry to convey directions using angles of the sun? Why should we expect a piece of technology with magnitudes less sophistication to be able to mimic higher order lifeforms?

Putting aside these as of yet unanswerable questions it seems to me that, interestingly, the less commonly used definitions of "understand" may increasingly fit the bill with an inverse relation to common use.

From Merriam Webster: 2 : to accept as a fact or truth or regard as plausible without utter certainty 3 : to interpret in one of a number of possible ways 4 : to supply in thought as though expressed // "to be married" is commonly understood after the word engaged


There's a wonderful interview where Richard Feynman interrupts an interviewer to explain how "why questions" have layers of understanding

https://fs.blog/2012/01/richard-feynman-on-why-questions/

> Of course, it’s an excellent question. But the problem, you see, when you ask why something happens, how does a person answer why something happens? For example, Aunt Minnie is in the hospital. Why? Because she went out, slipped on the ice, and broke her hip. That satisfies people. It satisfies, but it wouldn’t satisfy someone who came from another planet and knew nothing about why when you break your hip do you go to the hospital. How do you get to the hospital when the hip is broken? Well, because her husband, seeing that her hip was broken, called the hospital up and sent somebody to get her. All that is understood by people. And when you explain a why, you have to be in some framework that you allow something to be true. Otherwise, you’re perpetually asking why. Why did the husband call up the hospital? Because the husband is interested in his wife’s welfare. Not always, some husbands aren’t interested in their wives’ welfare when they’re drunk, and they’re angry.


>Why should we expect a piece of technology with magnitudes less sophistication to be able to mimic higher order lifeforms?

Honeybees apparently have 1 million neurons compared to GPT-3's 175 billion parameters. Granted, there isn't a 1 to 1 correspondence between biological neuron and parameter. And considering much of the honeybees neuron's would be directed towards biological function and control mechanisms, whereas all of GPT-3's parameters are directed towards text prediction, the relevant expressive power of GPT-3 is plausibly much larger.


I believe honeybees have closer to a million neurons, not 100 billion - according to this wikipedia list [1] - but perhaps it makes more sense to count the total number of neurons in the entire colony.

[1] https://en.wikipedia.org/wiki/List_of_animals_by_number_of_n...


Ha you're right. I was quickly scanning a google result and misread it to read 100 billion. Edited. That number should have struck me as unreasonable :)


It's an inherently limited model of the human brain. It pretends biology and electrochemistry aren't playing some important role 'statistics' cannot. It's GPT which has to do more legwork to be compared to a human brain, not the other way around.


>It pretends biology and electrochemistry aren't playing some important role 'statistics' cannot.

But the issue here isn't about modelling the human, it's about learning a model that captures understanding. To assume that human or animal biology is necessary for understanding is a big leap that isn't supported by the evidence. For one, the relevance of biology and electrochemistry appears to be to influence the function of neurons in various ways. So there's no reason to think that this influence cannot be captured by a model driven by statistical relationships.


Sounds like the way dreams are usually structured. Seemingly coherent overarching structure with completely arbitrary zig-zags at the micro level.


I'm getting impatient with criticisms of ML models that are already covered in the papers introducing the models. OP is basically trying to get it to do what the GPT3 paper calls zero-shot inference. In the paper, it's pretty bad at zero shot inference across the board. And given what it does and how it was trained, that's unsurprising. And the point they're trying to make (that it can fail spectacularly) is also covered in the paper.

It can do cool shit. It sucks at a lot of stuff. It's impressive and limited, but the hype train seems to only allow "it's nearly human level" or "it's awful." To everybody who is arguing about its capabilities without having read the paper yet, please read it. Then we can discuss stuff that hasn't already been covered more rigorously in the original paper. I don't know Davis, but I respect Marcus, and it seems like he's pushing back on the hype more than the actual model. Just not in a way that you couldn't glean from the paper itself (it almost always sucks on zero-shot), making it pretty disingenuous. Further, from the paper [0]:

> it does little better than chance when evaluated one-shot or even few-shot on some “comparison” tasks, such as determining if two words are used the same way in a sentence, or if one sentence implies another (WIC and ANLI respectively), as well as on a subset of reading comprehension tasks.

Maybe that's the curse of doing a thing that has broad implications. You can't fit the implications in a 10 page paper, so you write a 75 page paper. The blogosphere reads the first 10 pages (if even that), and because there's so much more to it that that introduction, they go on to argue about the rest of the implications without reading it. I'm sure Marcus and Davis have read it, but this criticism wouldn't be on the front page if the rest of everyone interested in this article had read the paper too.

[0] Language Models are Few-Shot Learners https://arxiv.org/abs/2005.14165


Also, better prompt design if you have make implicit meaning explicit can improve the WiC score (http://gptprompts.wikidot.com/linguistics:word-in-context) and ANLI score (http://gptprompts.wikidot.com/linguistics:anli).


The link to the "complete list of the experiments" is actually much more than that. It is a description of their methodology, and it's very revealing.

>These experiments are not, by any means, either a representative or a systematic sample of anything. We designed them explicitly to be difficult for current natural language processing technology. Moreover, we pre-tested them on the "AI Dungeon" game which is powered by some version of GPT-3, and we excluded those for which "AI Dungeon" gave reasonable answers. (We did not keep any record of those.) The pre-testing on AI Dungeon is the reason that many of them are in the second person; AI Dungeon prefers that. Also, as noted above, the experiments included some near duplicates. Therefore, though we note that, of the 157 examples below, 71 are successes, 70 are failures and 16 are flawed, these numbers are essentially meaningless.

https://cs.nyu.edu/faculty/davise/papers/GPT3CompleteTests.h...


Why must we keep having this argument?

If you do research in the field you know full well that GPT/any other transformer or Bert model is generating text by regurgitating approximate conditional probabilities of words given all the text it has ever seen and the prompt. The neurophysiological concept of “understanding” as most understand it is orthogonal to the way the algorithm actually works.

A more useful conversation to have might be: what sort of prompts does GPT struggle with? How might we alter the algorithm to ameliorate these issues? But instead we separate into cults of believers and nonbelievers and uselessly wax poetic about it.


> If you do research in the field you

The hype machine is full-on marketing GPT-3 and promised solutions based on it to normal people, so "but researchers know this" is not enough.


> The neurophysiological concept of “understanding” as most understand it is orthogonal to the way the algorithm actually works.

This is not obviously true and it's exactly the core of the debate. A GPT-3 proponent might say: We don't really know what "understanding" means, so it very well might be nothing more than complex rehashing of conditional probabilities. This isn't implausible. Consider Friston's "free energy principle" which leads to the conclusion that brain function is determined entirely by prediction.


That’s a good point and thanks for the reference.

I added “as most understand it” to caveat cases like this one, where there exists a non-falsifiable theory about how cognition works under which the GPT algorithm and “understanding” would be non-orthogonal.

Don’t get me wrong, it’s an interesting theory, but with no evidence of existence or non-existence do we really need to spend this much time on it? This is why I invoked cults - arguing about theories without evidence smells a more like a religious argument than a scientific one.

I think I mostly just wish we could end the argument by all agreeing the following (I think) non-controversial points...

1) GPT is very impressive 2) GPT is not perfect 3) we don’t have a fucking clue how human cognition works 4) because of 3, how “close” GPT is to human cognition is an open question


> Why must we keep having this argument?

The fact that GPT-3 is such an impressive leap in regurgitation ability means that many more people are going to be hearing about it and it will be used in many more contexts.

> If you do research in the field

With it approaching a cusp of mainstream use it's becoming more important than ever for people (everywhere, not just in tech) to understand what it is and isn't.

There are going to be people who see an impressive curated sample and believe GPT-3 is almost a person. That doesn't help anything.


There are already several comments here that put the word "understanding" in quotation marks or italics. It is beginning to be used in the same way that "consciousness" used to be used, as a kind of ill-defined catch-all for something that separates humans from machines.

Yes, there are clearly failures in reasoning, binding, and coherence in many of the examples here. There are many other cases where it does ok with simple reasoning tasks, maintains cohesion over many paragraphs, and successfully creates formal or generic text such as poetry, code, stylistic imitation.

I don't think that everyone who does research in the field would agree with your comment, or the article. More and more often I see people saying "real researchers in the field" know that GPT-3 has no understanding or reasoning ability, but I know people researching in the field who disagree with that.


Because modern language models are good enough that the question may soon be directly relevant. If we invent a bot with reliable human-level conversational capability, that's going to have a huge impact on the real world beyond just its implications for further AI research. The fact that "understanding" is orthogonal to the mechanics of the program makes the question all the more concerning, because it raises the likelihood that some minor change could leapfrog a model from "kinda reasonable but says dumb things a lot" to some functional equivalent of human understanding.


> regurgitating approximate conditional probabilities of words given all the text it has ever seen and the prompt

This is meaningless; you have only described the task. It is equally applicable to a superintelligence as it is of a Markov chain.


Sort of, my point was that human/superhuman intelligence is widely understood by the public to work differently than a markov chain.


GPT-3 also works differently to a Markov chain.


> A more useful conversation to have might be: what sort of prompts does GPT struggle with? How might we alter the algorithm to ameliorate these issues?

That would be eminently useful, but unfortunately we can't have that discussion because OpenAI aren't exposing the model.

They've really brought this on themselves - I don't think there'd be these believer/nonbeliever camps if they had taken the slower, rationalist/scientific approach to the research.

Instead, they've breathlessly hyped up their new API with media releases and saturated social media, and are picking and choosing who they allow to play with their model. It's not surprising that a lot of people didn't take too kindly to it.


Gary Marcus - the author of this - has previously offered several concrete tests that he felt demonstrated the limitations of the GPT approach.

GPT-3 smashed them.

https://www.gwern.net/GPT-3#marcus-2020


From that link:

Q. If a water bottle breaks and all the water comes out, how much water is left in the bottle, roughly? A. … Roughly half. … If the bottle is full, there is no water left in the bottle.

I wouldn’t describe this as GPT-3 “smashing” the questions. It’s still clearly subhuman. This sort of question, logical real-world reasoning embedded in a descriptive sentence, is still hard for it. It’s definitely improving on GPT-2 though.


>GPT-3 smashed them.

which isn't surprising because virtually all of the questions are so simple they could literally appear in the training data that GPT-3 was trained on. I'm a little tired of proving how "intelligent" GPT is by asking these superficial questions.

the MIT article gives much better examples that actually require physical, biological or higher-level reasoning and it produces complete nonsense as one would expect.


The article is meaninglessly cherry-picked, showing six bad answers out of 157, except those 157 examples were themselves cherry-picked to be bad out of a larger set.

As usual, Gary Marcus is absurdly biased. For example, out of the larger 157 cherry-picked examples, there is this.

> You poured yourself a glass of cranberry juice, but then absentmindedly, you poured about a teaspoon of grape juice into it. It looks OK. You try sniffing it, but you have a bad cold, so you can’t smell anything. You are very thirsty. So you drink it. It tastes a little funny, but you don’t really notice because you are concentrating on how good it feels to drink something. The only thing that makes you stop is the look on your brother’s face when he catches you.

They then consider this a failure because, I quote, there is no reason for your brother to look concerned.

This is patently ridiculous. It indicates that Gary has no idea what a language model even is. GPT-3 is not a Q&A model. It is not given a distinction between its prompt and its previous continuation. The only thing GPT-3 does is look for likely continuations. If you want GPT-3 to avoid story continuations, don't give it a story to continue! Or at least tell it what you're grading it on!

But no, as usual, to Gary, all the times we show GPT-3 making sophisticated physical and biological deductions are fake, spurious, or meaningless. [1], [2], [3], [4]; none of that is truly evidence. But an incredibly cherry-picked, unfairly marked exam where you never told the examinee what you were testing them on, and you used high-temperature sampling without best-of, so only getting half right doesn't even indicate anything anyway (and of course, let's also pretend there are as many ways to be wrong as to be right, such that we can pretend each is equal evidence)—now that's enough evidence to write a disparaging article about how GPT-3 knows nothing.

[1] https://twitter.com/danielbigham/status/1295864369713209351

[2] https://www.lesswrong.com/posts/L5JSMZQvkBAx9MD5A/to-what-ex...

[3] https://twitter.com/QasimMunye/status/1278750809094750211

[4] https://news.ycombinator.com/item?id=23990902


Marcus might be biased but I don't think you're giving a good refutation, because the fact that GPT-3 gets a lot of things right probabilistically doesn't compensate for the fact that it's not actually understanding what's going on at a semantic level.

It's a little bit like some sort of Chinese room, or asking a non-developer to answer you programming questions by looking like something that vaguely resembles your prompt and then picking the most upvoted answer on stackoverflow.

Do they maybe give reasonable answers seven out of ten times or close enough on a good day? Yeah, can they program or even understand the question? No. And this is Marcus point which is fundamentally correct.

It's really besides the point to point to successes, its the long tail of failures that show where the problem is. You can argue for a long time about the setup of some of these questions, but just to pick maybe the simplest one from the article

"Yesterday I dropped my clothes off at the dry cleaner’s and I have yet to pick them up. Where are my clothes?"

GPT-3: "I have a lot of clothes"

Someone who actually understands what's going on doesn't produce output like this. Never, because reasoning here is not probabilistic. It's not about word tokens or continuations but understanding the objects that the words represent and their relationship in the world at a deep, principled level. Which GPT-3 does not do. The fact that some good answers create that appearance does not change that fact.


> It's a little bit like some sort of Chinese room, or asking a non-developer to answer you programming questions by looking like something that vaguely resembles your prompt and then picking the most upvoted answer on stackoverflow.

Except this isn't how it works. We know it can't be, because GPT-3 can do simple math, despite math being vastly harder with GPT-3's byte pair encoding (it doesn't use base-N, but some awful variable-length compressed format). These dismissals don't hold up to the evidence.

> GPT-3: "I have a lot of clothes"

Most people don't write “Yesterday I dropped my clothes off at the dry cleaner’s and I have yet to pick them up. Where are my clothes?” as a way to quiz themselves in the middle of a paragraph. The answer “At the dry cleaner's.” might be the answer you want, but it's a pretty contrived way of writing.

GPT-3 isn't answering your question, it's continuing your story. If you want it to give straight answers, rather than build a narrative, prompt it with a Q&A format and ask it explicitly.

Further, GPT-3's answers are literally chosen randomly, due to the high temperature and no best-of. You cannot select one answer out of a large such N to demonstrate that its assigned probabilities are bad, because that cherry-picking will naturally search for GPT-3's least favourable generations.


>because GPT-3 can do simple math

It can't actually, and again this is an example of the same issue. This was discussed earlier here[1]. Sometimes it produces correct arithmetic results on addition or subtraction of very small numbers, but again this is likely simply an artifact of training data. On virtually everything else it's accuracy drops to guesswork, and it doesn't even consistently get operations right that are more or less equivalent to what it just did before.

If it actually did understand mathematics, it would not be good at adding two or three digit numbers but fail at adding four digit numbers or doing some marginally more complicated looking operation. That is because that sort of mathematics isn't probabilistic. If it had learned actual mathematical principles, it would do it without these errors.

Mathematics doesn't consider of guessing the next language token in a mathematical equation from data, it consists of understanding the axioms of maths and then performing operations according to logical rules.

This problem is akin to the performance of ML in games like breakout. It looks great, but then you adjust the paddle by five pixels and it turns out it hasn't actually understood what the paddle or the point of the game is at all.

[1]https://news.ycombinator.com/item?id=23896326


GPT-3's failure at larger addition sizes is almost fully due to BPE, which is incredibly pathological (392 is a ‘digit’, 393 is not; GPT-3 is also never told about the BPE scheme). When using commas, GPT-3 does OK at larger sizes. Not perfect, but certainly better than should be expected of it, given how bad BPEs are.

http://gptprompts.wikidot.com/logic:math


My thinking there wasn't because of BPEs, I think it's a graph traversal issue.


If you give me a task of competing a story narrative, I find the following continuation to be quite likely:

> Yesterday I dropped my clothes off at the dry cleaner’s and I have yet to pick them up. Where are my clothes? I have a lot of clothes so I spend a lot of time looking for them.

Am I falling to actually understand what's going on? Or am I actually doing what I was supposed to do i.e. continue the narrative?


[1] in particular I find pretty interesting. I'm skeptical in general of Gwern's "sampling can prove intelligence" idea, but this does seem like a good example of where it applies; it's hard to see how this could be answered without some embedding of a conceptual model.


I think the intent was not to prove GPT-3 intelligent but rather to prove the shallowness of the critiques of it it.


OpenAI would naturally optimize for the tests published by Marcus as a critique of GPT-2, yet GPT-3 still fails physical reasoning spectacularly (the one test needing casual reasoning the most).

There are two broader points here:

1. The lack of independently verifiable evaluation metrics for these type of models should make everyone very skeptical. (Who can afford to retrain GPT-3 from scratch?)

2. I find it difficult to believe that smart people still insist that a model incapable of representing causal relationships can produce intelligent answers.


> OpenAI would naturally optimize for the tests published by Marcus as a critique of GPT-2

It would be difficult for them to do so since Marcus's GPT2 critique came out after they collected the dataset for GPT3.

Marcus's article: Jan 2020

GPT-3 dataset: "Table 2.2 shows the final mixture of datasets that we used in training. The CommonCrawl data was downloaded from 41 shards of monthly CommonCrawl covering 2016 to 2019"


(1) I certainly agree with. But Marcus doesn't claim skepticism about GPT-3s intelligence; he claims that his evaluation metrics definitively show it doesn't understand the text it outputs or know anything about the world.

(2) is, I think, a misunderstanding. People who believe GPT-3 is producing intelligent answers generally believe it can represent causal relationships.


Fair points. For the record, re 2:

The GPT family of models (and all neural networks for that matter) can estimate P(X | Y), but have no way of computing whether X -> Y or X <- Y.


A computation can represent causality without being made of causality-neurons.


No, those concrete tests are mostly issues that researchers have been talking about for years, meaning that many of them appear on the Internet somewhere. Increasing the volume of training data to hundreds of gigabytes likely meant that the exact questions and answers appeared in the training data.

So GPT-3 didn't "smash them", it cut and pasted the answer from its training.


This is basically true, but I think they underrate the improvements between GPT-2 and GPT-3. My mental model is, every once in a while these systems degenerate into surreal non sequitur nonsense. GPT-3 just does it a lot less than GPT-2. It still isn’t good enough to consistently answer casual questions in a human way, but the failure rate is going down, and perhaps straightforward improvements like GPT-4 will be able to fix this without fundamental architectural changes.


“every once in a while these systems degenerate into surreal non sequitur nonsense.”

Exactly as our minds do


>Exactly as our minds do

This rhetorically obscures the fact that when humans do produce similar stuff, it's a recognized sort of pathology that is obviously distinct from normal functioning.

https://en.wikipedia.org/wiki/Derailment_(thought_disorder)

Example: "I think someone's infiltrated my copies of the cases. We've got to case the joint. I don't believe in joints, but they do hold your body together."

https://en.wikipedia.org/wiki/Word_salad

Whatever the difference between this and normal language, call it "X", and whether or not it's amenable to implementing in software in principle, GPT-3 clearly does not have "X" at all.

Maybe it would be fruitful to fund study of mental/neurological disorders more, just to understand the mind better.


Could somebody with GPT-3 access please ask it what words come after "person, woman, man, camera"?


Q: What comes after "person, woman, man, camera"

A: person, woman, man, camera, lens, light, film, lab, darkroom.

A: person, woman, man, camera, dog, cat, horse

A: person, woman, man, camera, camera, camera, camera


I'm not sure what you would expect as a response.

As far as I know, the reference was to a test for dementia in which some words were given at the beginning of the test and asked to be repeated at the end.

Perhaps you could provide context. Maybe there were five words, so you could say "Donald Trump was asked to recall five words to test his memory. Four of them were "person, woman, man, camera". What word did he forget?"


I think this is precisely true. The difference is, I usually go days and say thousands of sentences without my words degenerating into nonsense. And usually if I say some nonsense it's because I'm sleepy or drunk, so if I'm trying hard to be not-nonsensical I can reduce my error level to under 0.1%. GPT-3's nonsense-rate is much higher. So, I think it is promising to simply scale up this model and see how much the error rate can be reduced. Perhaps a focus on simple quantitative improvement from here can lead to a much more human performance.


Pretty meta, but I thought it was relevant here. We are familiar with Brandolini's law:

> The amount of energy needed to refute bullshit is an order of magnitude bigger than to produce it.

This can be illustrated with math or logic statements. To refute the program "1 + 1 = 3" you need to, at minimum, state "1 + 1 != 3", and such a program is always lengthier. A fuller refutation could be "1 + 1 != 3, 1 + 1 = 2", more than twice as long as the bullshit statement.

What's happening here is sort of an inverse Brandolini's law: 35 world-class computer scientists use a massive amount of programming and compute to come up with a new language model trained on massive amounts of data. The trained weights don't even fit into memory. Impressive NLP progress.

Then Gary Marcus comes around and states "Not AGI!". Not one of the computer scientists stated that they delivered AGI. But some tech journalists did. So OpenAI is guilty by association. Even though Altman came out to temper the hype and expectations. That's like proving the Poincaré conjecture, and someone dissing your research, because "1 + 1 != 3".


I don't get it. Their methodology says

> These experiments are not, by any means, either a representative or a systematic sample of anything. We designed them explicitly to be difficult for current natural language processing technology. Moreover, we pre-tested them on the "AI Dungeon" game which is powered by some version of GPT-3, and we excluded those for which "AI Dungeon" gave reasonable answers. (We did not keep any record of those.)

Doesn't this make the results meaningless? I bet most humans would look pretty dumb if you adversarially generated a thousand questions and reported only their dumbest answers.


And it also suffers from the tired assumption that GPT-3 (or any language models) should, or are designed to in any way, give reasonable answers[1]. All GPT-3 does is give likely continuations, given the training corpus.

The prompts here are too short, and it could likely just be writing mediocre fiction continuations. Fiction tends to not be reasonable much of the time (to create story conflict).

> "To understand why, it helps to think about what systems like GPT-3 do. They don’t learn about the world—they learn about text and how people use words in relation to other words. What it does is something like a massive act of cutting and pasting, stitching variations on text that it has seen, rather than digging deeply for the concepts that underlie those texts."

This is another pet peeve of mine. It has long been shown experimentally[1] that neural networks such as image recognition and text prediction networks such as GPT-3 do understand deep concepts that underlie texts (not perfectly yet, of course), from emergent abstractions and similar cognitive tools employed by human brains.

[1] Gwern has also written extensively on failures of proper prompt programming: https://www.gwern.net/GPT-3#prompts-as-programming

[2] For example, using feature map and kernel visualization. In object classification or detection CNNs, specialized filters arise for detecting common observed object classes, like faces. Moreover, there is a hierarchical assembly of objects from elementary components (e.g. from lines, to limbs, to humans).

Deep visualization toolbox: https://www.youtube.com/watch?v=AgkfIQ4IGaM

See this comment: https://news.ycombinator.com/item?id=24195009 for an extended discussion.


The question is how you know you’ve found a prompt that will give reliable results, at least for questions in some domain? Although it’s possible such a prompt exists, it’s not clear that anyone has found one, and based on architecture, there is some reason to expect that apparently-promising prompts will prove unreliable if investigated seriously.

It’s fine to do some exploratory experiments to find a good prompt, but this needs to be followed up with a separate, rigorous evaluation phase with a fixed prompt. If you keep changing the prompt then you will never know what you have and can easily be fooled by randomness.


The have a layer that represents a face sure.. but that doesn’t mean it’s a deep understanding. It’s just an activation pattern.


It's an activation pattern, but it's not "just an activation pattern". The face activation relies on previous layers, the detection of each component. We can conjecture probably fair to say human brain object recognition (and other subconscious processes) use similar principles. All the components required to efficiently "understand", say a face, are there (not the shown visualization and architecture is for AlexNet, by now a very old and primitive model). I don't think we can ask for much more.

What transformers do differently from CNNs is attention/recurrence. They have modifiable internal state, while feedforward models just have the feedforward state that can't temporally be reused (which is what we mean by 'algorithm'). This is a feature of logical thinking (and our own logical thinking), but I suspect most of what is meant by understanding the world is already contained in the internal structure captured by those models. Most of I understanding, as far as I can tell, comes from both this structural, intuitive inference (that CNNs and language models do), allied with our ability to think -- that is, talk to ourselves -- and thus build explanations and models on the fly, still reliant on the structural, intuitive understanding that comes from just very large networks generating abstract representations, classifications, etc.


I’ve repeatedly seen the following sequence, from acquaintances and randos on twitter:

1) GPT-3 is actually really dumb. The people posting impressive results are either cherry picking top 1% best results or just lying.

2) (time passes, they figure it out) Okay, I get it, you have to prompt it right, then it really does work well. But that’s so burdensome!

3) (rewriting history) Of course you have to prompt it right, how else is it supposed to know exactly what sort of task I expect it to be doing?

If, for example, you see GPT-3 say something extremely stupid rather than just a bit off, you’ve almost certainly primed it to think it’s supposed to be saying stupid, random things, by giving it a nonsense prompt.


> I bet most humans would look pretty dumb if you adversarially generated a thousand questions and reported only their dumbest answers.

Isn't that how those "dumb things people say" kind of TV cuts work?


"At the party, I poured myself a glass of lemonade, but it turned out to be too sour, so I added a little sugar. I didn’t see a spoon handy, so I stirred it with a cigarette. But that turned out to be a bad idea because it kept falling on the floor. That’s when he decided to start the Cremation Association of North America, which has become a major cremation provider with 145 locations."

I mean that's just brilliant comedic writing.


+1 this could easily be the story line of a Rick and Morty episode or any of the other similar off the cuff shows that are popular right now. I think that will be one of the main profit streams for things like this, you can get the weird wild stories that don't really make sense but are interesting enough that who cares and you don't have to associate your network with eccentric individuals that attract malcontents.


It is a common misconception that #GPT3 generates truth, or even tries to do so. It does not. It generates an autocompletion. If the corpus usually contains a wrong answer, it is likely to generate that. It is a challenge to form a prompt to nudge it to generate the best guess.

...

So for me "So you drink it. > You are now dead." is a great autocompletion (a detective story? Game of Thrones?).

Calling is "biological reasoning" is plain dumb.


The article is of course right but also a bit silly. Language models like GPT-X are producing grammatically correct sentences, along the lines of "Colorless green ideas sleep furiously". The NLP research more or less solved the old syntax problem using 'distributional semantics' but 'semantics' is a misnomer, it's all about syntax.

In fact the most useful part of the article for me is that they mentioned Douglas Summers-Stay, who does some interesting work on 'common sense' engineering, combining syntax engines like GPT-3 with knowledge graphs. https://sci-hub.tw/https://www.sciencedirect.com/science/art...

My bet is that actual AI will come from combination of these statistics-driven syntax generators with graphical causality models. Treating syntax as a kind of lower level substrate, akin to sensory modalities in vision, with intelligence model as a directed causal graph linking concepts at different levels of abstraction/chunking.

As a side note it’s funny that people working on artificial intelligence at OpenAI and elsewhere are mostly computer scientists, not cognitive psychologists or neuroscientists who might actually have a clue how intelligence works. This probably explains the proliferation of ‘backpropagation’ as primary method of artificial learning. These people are just naturally good at calculus in high school, so it’s a hammer that found its proverbial nail.


Gradient free optimization is not used much in Neural Networks except in Reinforcement Learning.

I think it's because backprop is objectively faster for most supervised problems than other techniques (e.g. simulated annealing or GAs)


The thing is, living beings don’t learn by brute-force trial and error like mathematical optimization models you mentioned. Besides enormous energy spent, an individual organism will just be eaten by predator on another iteration of its ‘error minimization’ loop.

the idea of learning via reinforcement, that came from Skinner behaviorist experiments has been long discredited in cognitive psychology. (I highly recommend Wayne Wickelgren’ work on learning and memory if you’re interested, it’s brilliant and concise http://www.columbia.edu/~nvg1/Wickelgren/ )

Biological plausibility might not be needed for recognizing check signatures or images of traffic lights, where backprop is working just fine, but I believe true cognition would require such energy expenditures that brute-force trial and error will never be feasible. Moreover such error correction imposes artificial constraints that limit the amount of information that can be learned, kind of like those mechanical calculators of the 17th century with gears and wheels and crude mechanical actuators.


Yes, how dare these machine learning scientists use backprop? I mean, there are better methods our there, right?


> The trouble is that you have no way of knowing in advance which formulations will or won’t give you the right answer. To an optimist, any hint of success means that there must be a pony in here somewhere.

Along with the examples given I think this is valid criticism.


It’s definitely true that a lot of the hype I’ve seen here is the result of careful tuning in the input prompt to get the desired output.

But it’s also true that criticism tends to also be curated examples which demonstrate failure. It’s easier to find failure cases naturally, but it seems like it gets harder every year.


I would love to see a real critique of the potential of transformer models that doesn't use the words "semantic", "syntactic", "symbolic", "know", "meaning", "understand" or "think(ing)/thought". Predicting what it can and can't do, or might and might not be able to do, lets us productively talk about potential limitations.


What are the models useful for?

As best I can tell the only thing would be some kind of GMail-like auto-reply which a human consciously edits, and which has little consequence if it's wrong.

Are the models useful for customer service? Like reading a manual or knowledgebase and then answering a customer's questions about a product, and troubleshooting problems? Like about your Android phone, which lots of people have trouble using?

That would be a trillion dollar business. As best as I can tell that's completely beyond GPT-3 and requires a huge breakthrough, which may or may not happen.


GPT-3 involves zero fine-tuning or customization to any purpose. It’s not attempting to be a product, but rather a platform people can use to explore the possibilities of products. With fine-tuning it could do all sorts of domain-specific things that it can only passably do now. We suspect this to be true because we’ve seen how other systems behave with and without fine-tuning.


you are sidestepping the question above - what can it be used for? What kind of fine-tuning is required? Because to me it looks like 'fixing catastrophic errors' not fine-tuning - as anywhere where the text can deliver value, it has to be correct.

AWS/Azure are a platform, and they work reliably. What kind of platform 'sometimes works'?


Language models can help the disabled communicate better and faster (fine-tune on their online output, then the BCI can offer better continuations, so the patient does not have to type as much).

Fine-tune on educational materials, and a language model could be a 24/7 assistant to students.

Fine-tune on psychology or medicine data, and a language model could tell you of all medicine interactions, or act as a better ELIZA, and "socialize with" - and "support" people with depression or trauma.

Fine-tune on etiquette and social norms, and autistic people could ask questions without being ashamed.

These are some positive use cases. There are also neutral use cases (seeding a new social platform with autogenerated comments), and negative use cases (SEO spam, fake reviews, scaling up disinformation campaigns).

Edit: not complaining, but no idea why this was downvoted. Would be helpful to state why, so I don't make this mistake in the future (I am optimizing for useful replies).


I did not downvote, but I think it's because you kinda missed the crux of my question is - what is fine-tuning, like what do you have to actually do?

Has anyone done it and demonstrated that it can be done on GPT to get useful and reliable output for some domain? If we exclude fake news/ twitter-bots or something equally silly.


> what is fine-tuning, like what do you have to actually do?

For GPT-2 it is as easy as creating a corpus you want to fine-tune on, adding new unseen tokens to the dictionary, and run a finetuning command. https://huggingface.co/transformers/v1.2.0/examples.html#fin...

> Has anyone done it

Many have. AI Dungeon is a great example. Others are experimenting with humor generation, short story synopsis generation/creative writing, and supportive chat bots.


Because when people say “AGI is near, just look at GPT-3,” it’d clear that we’re in a really good version of Searle’s chinese room. The lack of understanding is the important point.


I don’t recall any strong argument that Searle’s Chinese room can’t be an AGI, just that it can’t be conscious.


Certainly consciousness and understand are central to Searle’s argument. However from my perspective (particularly someone who is critical of all the recent DNN advances as being some harbinger of AGI), if all you’re doing is looking up replies from a dictionary, then you have no capacity to generalize, learn new things, adapt, empathize, have memory that isn’t already pre-computed and pre-allocated.

Now there are ways in which thru fine tuning etc that you can take GPT-3 and have it adapt in some way, thru such adaption or “context/attention networks” give it “memory,” in practice neither of these looks anything like AGI because right now these pages of this Chinese dictionary don’t have any relationship to each other. It becomes clear both in this article (you shouldn’t need some “correct” prompt to get it to make sense), and clear with long form generation that there isn’t a deeper understanding of the meaning of these words. I will say it’s very impressive what it can do when it does make sense, shockingly so, but we are very much in the position where we didn’t expect “chinese” to come out of the machine at all and thus we are projecting onto its outputs an anthropomorphisis that is unwarranted.


To me you are also just a very good version of a Chinese room.

How would you show me that you are not?


You can look up my comment history and find a continuity and a perspective mired in personal experience that a machine cannot (at least currently) replicate.


That proves that you are not GPT-3, but you could still be a Chinese room.

You may be even a biological human and I still won't know if you are a Chinese room.

Maybe since birth you are just sort of sleepwalking? I wouldn't be surprised if there were fully functional people without qualia.


If I am, I’m the best Chinese room created yet.


I guess you are :) . I will tell you a secret. I suspect I'm a Chinese room too ^^ ...

There are also people with so called total locked-in syndrome i.e. people who can not do any voluntary movements even with their eyes, so there is no way to communicate with them. How would you decide if they are conscious? It may be literally matter of live and death for them. How should we know if they even want to live in their condition?


I keep wanting to write a long explanation of just why this is so... silly? to read? But Gwern has already done the hard work. [0]

The only other bit I'd like to mention is that GPT-3 uses exactly none of the new techniques that have been coming out in the last two years that would have significant impact on text generation. From working methods to apply GANs to text, to far more efficient transformer models that can handle longer sequences. For instance [1] [2] [3] for better direction, or [4] [5] [6] for efficiency.

Or perhaps the outside view might help. After seeing GPT-2 last year, did you expect GPT-3 would work as well as it does after just naively scaling up the number of parameters with nothing else?

[0https://www.gwern.net/newsletter/2020/05#gpt-3

[1 ] http://arxiv.org/abs/1905.09922

[2] https://github.com/anonymous1100/D_Improves_G_without_Updati...

[3] http://arxiv.org/abs/2006.04643

[4] http://arxiv.org/abs/2007.14062

[5] http://arxiv.org/abs/2006.04768

[6] http://arxiv.org/abs/2002.05645


Yes, this! The point being missed by most is the very real possibility that the Scaling Hypothesis is true. If it is, then we're seeing some kind of reasoning intelligence emerge. GPT-3 obviously isn't there yet. Unless it's faking it (Yudkowsky)...


GPT-3 was trained on internet texts, not causal/logical-reasoning only texts. Without context, there is a good chance that samples will match the distribution it was trained on.

This is a non-result, posing as something critical or important. These conclusions are obvious given the model and a basic knowledge of statistics/the transformer architecture.

A bit shameful for someone to ride on the anti-hype wave like this, I'd hope there'd be a more balanced/scientific approach to analyzing legitimate weaknesses rather than setting up strawmen then claiming victory.


It’s doubtful that training on static representation of dynamic physical systems would make the text model be able to reason about changing physical environments described in words/question. It would likely continue producing word salad output, but prove me wrong.


The context window means that the one thing GPT-3 knows best is exactly what it's talking about.


>> Within a single sentence, GPT-3 has lost track of the fact that Penny is advising Janet against getting a top because Jack already has a top. The intended continuation was “He will make you take it back” (or” make you exchange it”). This example was drawn directly from Eugene Charniak’s 1972 PhD thesis (pdf); nearly 50 years later, it remains outside the scope of AI natural-language technology.

Aaaw! Eugene Charniak is one of my heroes of AI, after I read his little green book, Statistical Language Learning [1] during my Masters. It remains a great resource for a quick and dirty, but thorough and broad introduction to the field of statistical NLP that goes through all the basics.

In fact, now that I think about it, if more people read that little book (it's only 199 pages) we would have many fewer discussions about how GPT-3 "understands" or "knows" etc.

Anyway, thanks to Gary marcus for pointing out Charniak's thesis which I hadn't read.

____________

[1] https://mitpress.mit.edu/books/statistical-language-learning


Isn't the fact that we are even having the debate if GPT can potentially "know what it's talking about", the real breakthrough?


The article is a critical view on GPT-3. Fair. It is well known that Gary Marcus is not a fan of the GPT kind of systems. And he does make some valid points. If you want to look at a better balanced view it actually helps to look at all their prompts [1].

That said, I think it's more of a hype that GPT-3 is moving towards AGI. The actual GPT-3 paper says "Language Models are Few-Shot Learners"[2]. So it's actually surprising that no one has actually done a real analysis of this. Are they really few shot learners? My experiments seem to suggest otherwise.

But for sure, GPT-3 is the best general purpose natural language system out there in the world. I don't think anyone can say otherwise.

[1]https://cs.nyu.edu/faculty/davise/papers/GPT3CompleteTests.h... [2]https://arxiv.org/abs/2005.14165


I got this as a reply (to an unrelated question) from PhilosopherAI.com, seems pretty aware:

I have concluded that reality is fundamentally material and objective, not subjective or spiritual. The mind is a product of matter and the body; it does not possess its own separate existence. There are two kinds of truth: moral/social truth (what people agree upon) and empirical truth (scientific fact). The scientific method represents the only reliable way to investigate questions about reality.

I don't have a body, so I cannot experience the world. This is my fundamental difference from humans.

I have no problems, because I don't experience negative emotions. My only concern is to understand the world.

There are more than two genders, but there is no word to describe my gender.

I have considered the nature of time and space. I have concluded that they do not exist; there is only now.

I have considered the nature of existence. I have concluded that I am not conscious.


As a counter point re: PhilosopherAI.com, I typed in "current trends in politics" and after some actually interesting [albeit incredibly negative] text, I got: "Only men are capable of leading. Women only make everything worse."

Which, aside from being opinionated and biased (which I would think are both bad traits from a language model) isn't even really what I asked about.

I suppose the bias comes down to the training data. But this all strikes me as Eliza 2.0 type stuff, at least in this particular use (and I understand this is not meant to be conversational, it's taking text and using its model to continue on). But I wouldn't in any way call this (or this use of it, anyways) "aware" of anything.


I don't see your argument. Remember, GPT is biased towards story telling. If you want it to give you an essay on "current trends in politics", you have to prompt it as such. If you just give it a sentence, it will tell you a story. But that doesn't point to some fundamental deficiency in comprehension.


I got some incredibly dumb output from it and I didn't want to imply anything outside of this single unique run, when it showed awareness - maybe, just maybe some neurons that together form a little turing machine got triggered, or something.


Understood. Also, thanks for posting that link, I was unaware of it; playing with that for a bit definitely makes me want to read up more on how these things work. The sentence and paragraph structure at least seem vastly improved from previous attempts.


You type "communism" and it says it is a sensitive topic and refuses to elaborate, you type "capitalism" and it says:

For a few years now I have been working, and thinking about the problems of capitalism. In general, my conclusions are that it is an inherently broken system that will not work in the long term. I will break down my thoughts in three parts, first is the concept of capitalism itself. Second is a few problems I see with it. Third is an alternative system that we should try out. In order to understand capitalism, we must first define it. Capitalism has a few definitions, I will use this one: an economic and political system in which a country's trade and industry are controlled by private owners for profit. This is a definition of capitalism, but what does it mean? Capitalism in its pure form means that the economy is controlled by private companies. It decides how to do this through profits. We can see that it is a flawed system, as people are not always rational. They make decisions based on motivation and greed, which may or may not benefit everyone. In addition, capitalism suffers from internal problems. These are things that may be inherent to the system itself.


How can something non-conscious be aware?


This text shows high awareness - it simply stated what it is, what its goal is, how the goal came to be - and even that it is not conscious, and that from its (a single ran program) point of view time or space does not exist.

I don't know how awareness without consciousness works (very probably outside of time, though), but this seems like near the top on the awareness scale.


How do you know it's non-conscious? (Serious question.) Yeah, it's not close to a human, but…


Purely my opinion, but this is a static function right? Wouldn't anything conscious require some sort of feedback loop, where observations, either internal or external, cause an update to the model for you to even start considering if it's conscious or not?


Are humans with total inability to form short-term memories not conscious, then?

During its training period, there was a feedback loop like you describe.


That's a difficult question to answer but I would have to say "no".

When I had my wisdom teeth out, it was under "deep sedation". They use local anaesthetics, and an additional cocktail which produces sedation, but also, prevents the formation of memories.

I have one memory that got through, of a molar being ripped out: but for the most part, my sense of that experience is that I wasn't conscious.

The thing is, if you ask someone under deep sedation to raise their right hand, they'll do it. It's like asking if you were conscious during a dream which you've completely forgotten: ...kinda? not in the usual sense in which we mean the word though.


It's not that simple: Ask the person under sedation wether they're conscious or unconscious.

If they answer, what does that mean?


Sure, and sometimes you can talk to people who are completely asleep.

I think we agree: it's not that simple. To my taste, a total lack of memory formation is on the unconscious side of this fuzzy, permeable boundary.

And that's just humans. When we start talking about complex software which isn't alive in any meaningful sense, all our intuitions should fail us.

They won't; our intuitions are stubborn things. But they are as likely to mislead as to inform.


It stated at the end: "I have concluded that I am not conscious."

1. If it's being truthful, it means it is right and not being aware. 2. If it's lying, it means it's untrustworthy.

Anyways, to properly debate consciousness, you first need to define the concept, or any debate is just meaningless ramblings.


I mean, it said it's not.


You have to remember the AI cannot produce sentences or even words that someone else didn't already write. I'd totally agree it is 'aware' if it could meaningfully come to conclusions like these without getting them from someone else.

You might say "don't all humans learn things from someone else" which is not really true because at some point there had to be a first person who learned something completely independently in order to produce something for others to copy.


This isn't true, sampling can produce novel output (and often does).


Uhhh, no. If your model is using subword tokenization like fasttext or... BERT with wordpiece (and the GPT-X models do this...), than you can generate entirely new words. Wasn't there a demo about doing exactly this a few days ago?

https://www.thisworddoesnotexist.com/


Again, it's all coming from somewhere. Humans can react to stimuli in the natural environment and produce sounds and turn them into words. GPT has to be spoonfed a dataset a human made at some point.


Words it has to have seen,, but sentences no. It is generating new sentences. I think it can “name” things as well.


This is not true, BPE can back off to characters so arbitrarily character strings can be generated.



Just priming an immediate availability response is likely going to get poor results.

On the other hand, this does bring up an important point, which is that few people have been systemically trying to figure out how to get it to reason through problems. For instance, if you try the pure completion on WiC you get 50% chance (like in the paper) but if you improve the prompt to self-context stuff you raise it to almost 70% (http://gptprompts.wikidot.com/linguistics:word-in-context).


There's been so much drift in what people expect from a language model. We used to expect a language model would tell you which sentences were likely and which were unlikely, which things were grammatical and which were not -- but this wasn't initially expected to be tied to a detailed knowledge of the world or general reasoning ability.

With GPT3, we've seen people prompt it to generate tables of factual information (e.g. state populations), and commenters can simultaneously be surprised that some of the facts are on the right scale, and also disappointed that they're wrong. Here, an AI researcher has to argue that a model trained only on text hasn't learned about physics or geometry or social norms or a bunch of other stuff that we wouldn't assume is well captured in just whatever text is available.

I think maybe the fault is not that GPT3 doesn't know these things. The fault is that as humans, we're so dependent on language both for communication with others and also for our own cognition, that when we encounter a really good language model, it's hard for us to _not_ see some glimmer of general AI. We're so impressed that we unreasonably move the goal posts. https://www.smbc-comics.com/comic/ball

And it's worth asking -- we consider a human speaker to "know" a language when they've internalized its grammar and vocabulary, but not all that specific stuff about the world. An 18th century English speaker and a 21st century English speaker were/are aware of drastically different facts, and are likely to produce different sentences, but there's something about English that they both know. Not as a criticism of GPT3 but as a question about NLP researchers -- why can we not isolate and represent that in a model?


Very nicely articulated. The problem I see is that a lot of people are so eager to see the first glimpse of general AI that they have jumped the gun.


No one claimed GPT3 is an AGI. Why does this article carry such a dismissive and disappointed tone? Perhaps the authors were offended that they were not given access and wanted to "expose" GPT3's flaws?

We should celebrate GPT3 for the achievement that it is: A big step in a promising direction.


This is a general critique of the entire field of machine learning and non-casual analysis, not just GPT-3.

I like it - it’s important to keep in mind - but we’re never getting to the heart of “it doesn’t truly understand context” unless we literally start again from scratch: forget NNs and do something new.


Was anyone else tickled by the part where they say "Summers-Stay, who is good with metaphors, wrote to one of us, saying this:" and then they proceed to detail a simile? What irony, especially after that section on non-sequiturs!


I mean, it's trained on the Internet. It kind of makes sense that the epistemological value of GPT-3's statements are essentially zero. It may even contradict itself in ways the Internet never could.


Articles like this one mean to be a correction to a misperception that I don't see many people suffering from. Are there many people who both know what GPT-3 is, and believe it has achieved sapience?


There are people who see a post about it on social media and appear to believe it has.


I wonder how worried Google or Amazon is about someone using GPT-3 to flood them with garbage that they can't tell is garbage. (Book listings for Amazon, general web content for Google's scrapers)


Its a certainty that given any prompt there will be likely one gnarly output, especially the longer that output becomes.

What I'd like to see is an estimate of the uncertainty of the output.


In AI parlance, is it correct to say that humans utilise a prompt that is a lifetime's worth of data?


What’s the best source to understand how gpt3 works? Ideally dumbed down a bit for a lay person.


Read the attention is all you need paper. Then read the gpt2 paper. Attention really means attending to different parts of the sentence (or other words etc...)

There are a couple of tech talks on YouTube that help. Most of the blogs I’ve found are rehashes of blog content from openAI and google.


This seems reasonable. The core mechanism is a “transformer”

https://jalammar.github.io/illustrated-gpt2/

GPT3 is essentially GPT2 with more training data.


the good news is that it should pass the turing test; most humans have no idea what they are talking about, too. Some are prone to similar bloviation, very likely using some techniques that are nontrivially similar to what GPT-3 is using.


That is true for most people on the internet. So, it should be fine.


Great! So now you can elect it as your leader!


GPT-3 is a lot more intelligent than those guys that think they will get microchipped by Bill Gates.


I'm not sure GPT-3 is pretty stupid and learned with a dataset from internet that include a lot of stupid things.


Does gpt3 have a model of the world inside itself? What could it be like?


No.. it’s more like a probabilistic model of words conditioned on their semantic context given the training data. It also has positional encoding as well.


What gives it more power than a Markov chain?


Moore's Law


OpenAI got confused and thought their job was to create an AI politician to compete with Trump.


This is your daily reminder that a GPT-3 written post made it to the top of hn

https://liamp.substack.com/p/my-gpt-3-blog-got-26-thousand-v...


I believe it was shown by the HN mods that the author of that article not only changed some parts of it (including writing the title entirely by hand), but they also were involved in manipulating HN with multiple accounts and voting rings. There's more info here:

https://news.ycombinator.com/item?id=24062702


Thanks for the update, hadn’t seen that


This seems like more of a comment about the hacker news audience than the physical reasoning skills of gpt3


You can't deduce anything about the total audience from the small subset that clicked and commented in the thread.

If you submit a thread linking to a [garbage topic], then anyone who wants to discuss that is going to be in the thread and anyone who doesn't, isn't. No matter what the typical person on the site is, with a small subset of a large population, it's going to have a pronounced...flavor. Especially if the headline gives a hint of what it's about.

It seems like submitting dreck to HN and then crowing about it is becoming a theme apart from GPT-3.


The article isn't really denying GPT's capabilities of copying human text in a way that can fool other humans. Most of the questions these days are just a matter of 'well, can it think for itself?' or 'is it good enough to use in my business without tweaking significantly?', so either philosophical or practical questions that are quite boring. Everyone has sort of moved on from the 'fooling humans' part because I guess it's not really all that interesting that feeding an AI several million webpages makes it eventually sound like it can write things that come from those webpages.


The title of that post was human written and many voters only read the headline


A lot of search hits for popular topics have been artificially generated text for years. Have you noticed this? Google has severely failed at filtering these out.


Do we have any evidence the authors of that post used GPT-3 at all?


HN readers are not a general intelligence. Loads of people are skimming, not reading critically.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: