Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Total trash cloaked in a complicated story.

What they actually did is ask 5 random people to rate what thought a language model could do to help different professions. These 5 random people don't know anything about the professions they're rating, just what anyone off the street knows, and they know as much about GPT as anyone who has briefly played with it.

The title should have been "We asked 5 friends to see what they thought about GPT and labor market"



This is what they are even admitting to:

Under "3.4 Limitations of our methodology" - "3.4.1 Subjective human judgments"

> A fundamental limitation of our approach lies in the subjectivity of the labeling. In our study, we employ annotators who are familiar with the GPT models’ capabilities. However, this group is not occupationally diverse, potentially leading to biased judgments regarding GPTs’ reliability and effectiveness in performing tasks within unfamiliar occupations. We acknowledge that obtaining high-quality labels for each task in an occupation requires workers engaged in those occupations or, at a minimum, possessing in-depth knowledge of the diverse tasks within those occupations. This represents an important area for future work in validating these results.


For sure. I had to read the paper to discover it's trash.

But if you read the abstract, it looks like they thoroughly assessed how GPT will impact many professions.

The sentence "Using a new rubric, we assess occupations based on their correspondence with GPT capabilities, incorporating both human expertise and classifications from GPT-4." does not scream to me "We asked 5 random people with no expertise in either these professions or GPT-4 what they thought and report those results".

This is borderline dishonest.


I agree with you. This is not scientifically sound research. Reads more like a brochure to be honest.


Eh ... they report pretty good alignment with other studies on the topic, so there's at least some signal. Whether their labels contribute any new information is unknown, and the forecasts of any of the literature they cite are untestable (expect by the wait-and-see approach).

That said, some attempts at prognostication are preferable to a collective shrug, and people at OpenAI are better positioned than others to assess what GPT-4+ is (will be) capable of, while clearly under-equipped to map that capabilities to the intricacies of 1000 occupational categories.


I can't find how many people labeled the DWA task descriptions, where did you got that number?

The article seems to describing the labeling here:

> Human Ratings: We obtained human annotations by applying the rubric to each ONET Detailed Worker Activity (DWA) and a subset of all ONET tasks and then aggregated those DWA and task scores at the task and occupation levels. To ensure the quality of these annotations, the authors personally labeled a large sample of tasks and DWAs and enlisted experienced human annotators who have extensively reviewed GPT outputs as part of OpenAI’s alignment work (Ouyang et al., 2022).

I understand the authors, four, did the initial labeling and then asked an undefined set of people to the rest of the labeling.


It is stated that they use the same annotators that trained/filtered chatGPT’s output. I would assume its a rather large group (my company has 10 auditors in Nicaragua). The label biases are mostly stemming from that group and - as suggested - could be removed by using experts in each field to annotate the labels. But given some responses here by experts, I am sure those expert labels would have their very own biases :p


The paper is not of highest quality indicated by typos and mislabels but the analysis is likely as good as it can get for the given methodology. Dismissing any signal is just pure hubris.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: