Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

ُThe Arabic text is the translator's self credit

"Translated by Nancy Qanfar"



I know it’s off topic, but it reminded me that translators like to put in Easter eggs, or at least they used to: https://learn.microsoft.com/en-us/archive/blogs/ericfitz/i-a...


And the German is “subtitles of [public broadcaster] for [content network], 2017

I'm not sure this is really overfitting, the network does exactly what the training data demands. According to the training data silence art the end transcribes to a copyright notice or subtitle credits


> I'm not sure this is really overfitting, the network does exactly what the training data demands.

What do you think overfitting is, if not that?


Overfitting would be replicating overly specific details. Like if a specific pattern of silence (or quiet noise) matched to specific copyright notices.

But in this case the behavior seems to generalize over multiple languages, with the model choosing representative "outro silence" captions depending on the language. Which is consistent with the training data showing that outro silence is captioned.

If the model was generalizing perfectly it would show something like "[subtitle credits here]" but that'd be demanding a bit much.

Transcribing outro silence as silence despite the training data consistently transcribing outro silence differently from regular silence would be underfitting


The optimizer is functioning correctly, and the pattern really exists in the training data. But consider:

- This behavior damages the model's performance on out of sample data; every word you predict during silence increases the transcript's Word Error Rate.

- These translation credits are an artifact of our training data, and not a reflection of the process we are modeling (spoken language).

So, while you are correct about the mechanism at work here, it is still correct to call learning a spurious pattern which damages our performance "overfitting".


Overfitting is achieving better and better scores on the training material and worse and worse scores on unseen tasks. More at: https://en.wikipedia.org/wiki/Overfitting#Machine_learning

This is just wrong training data.


fitting on noise in the training data is exactly what overfitting is. underfitting is smoothing out signal


Overfitting implies a failure to properly generalize the training data. Here it generalized them correctly. Garbage in, garbage out.


No. Because there would have been indtances in the data where silence was labelled correctly. But the model couldnt handle the null case, so it over fit on the outros. But generally it fit on the random error in the label of the null feature. Which is what overfitting is


Exactly. Underfitting would be if the model doesn't pick up on the fact that outro silence is labeled differently from regular silence and transcribes them the same


That's literally what overfitting means.

Side-note: it's also yet more evidence that AI companies hoover all data with no regard for legality or copyright status, the very same offences that got other people in jail or with heavy fines.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: