Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Indeed, with another model I would get persistent transcriptions of silent parts into 'Thanks for watching!' or '[MUSIC]'. Pretty dumb that this failure mode wasn't caught in some QA process, and there are now multiple transcription models suffering from the same issue. Having silent parts in your input audio seems like it should be a very common occurrence...


When I was taught mathematics, the zero value was always considered the most important edge case. You prove something for N=0 (or N=1), then for N=M+1.

It's even more important in audio DSP: processing near-zeroes can end up being extremely CPU intensive, look up denormal/subnormal floats.


Yeah, I studied mathematics (algebra and number theory) and zero is the point, often sporting discontinuities, or weird asymptotic behavior.

Quite a lot of algorithms use some form of division and zero is the only number in our typical structures (Z, Q, R, C), that cannot be used to divide with.


In machine integer arithmetics, one must also beware division by -1, which can convert MIN_INT into MIN_INT with a signed overflow and violate some arithmetics invariants, such as sign (negative divided by negative is _usually_ positive).


Well, now in this brave new age of AI we can enjoy computer programs crashing with an

    Error: division by please upvote, share and like!


This also works; I upvoted your comment.


I have discovered a truly marvelous proof of how to smash that like and subscribe button, which this comment box is too small to contain.


Signed by Pierre de FermAIt


NaN


Denormals are flushed to zero by default on most GPUs by the way.


Makes total sense, execution time is bounded. The point is it's still a case you must consider (what if near-zero is distinct from zero and significant?)


whisper MUST be combined with silence detection / VAD


Ah, the good old "you're holding it wrong".

What good is a speech recognition tool that literally hears imaginary voices?


Considering that if you DO use VAD (voice activity detection), it's the best open weights voice recognition model by a very wide margin, it's quite good. I'd be willing to be that commercial products that "don't have this problem" are using VAD as well, and that this is well known to them. But Whisper is just the weights, and I suppose a simple reference implementation, not a full product.


> What good is a speech recognition tool that literally hears imaginary voices?

Well, if it is supposed to work after silence detection, then it is good for speech recognition I guess. It's like blaming a wheel why is it circular, you can't sit on it. It's a part of a larger machine.


Just lay the wheel on its side and it makes a fine seat.


>imaginary voices

On the other hand, I can imagine that when things get quiet and the signal-to-noise ratio gets close to zero, random background audio (or randomness introduced in the transcription model) will be enough to tickle a critical number of neurons and elicit hallucinations.

The related thought exercise is this: Try scanning across the band with an AM or sideband radio, and after a while your brain will start to wonder "was that a voice I just heard, or music perhaps?" when in reality it was just environmental static.


Yes, you are holding it wrong. The good of it is that it does not output imaginary voices when used with VAD.

Show us a technology with better results that does not use VAD. If you can’t, then I’m not sure what you’re arguing against except superficialities so inconsequential that I can’t comprehend the condescension. The results speak for itself


faster-whisper has a min_silence_duration_ms option


There are much higher quality VAD solutions available


Please name a couple to get someone started who's hacking on webapps?

I'd really appreciate it.


(as would future readers, I'm sure)



I last used silero but haven’t kept up with stage of the art so didn’t mention it


So if a tool has a process to have it perform at its best then it's a problem?

Do you also moan that before applying glue to a surface or it won't stick? Or if you need to drill a guiding hole before making a larger one in wood? Or that you need to use truly prime numbers for a security key to actually be safe?


What's a good starter VAD lib, and if you know, the best implementation of something like this to use in a browser-based app?

Say if I wanted to use it for Voice Nav, or Voice Input, but not piss off random people speaking the wrong language.


If that's truly the case then they should make it part of the product, IMHO.


How is it not the case? It is unusable without VAD or editing. I don't understand what you're questioning

I agree their products could be better "end to end" integrated. Meanwhile there is a continuously-improving field of work for detecting speech (which Whisper is incapable of). They offer official "cookbooks" with guidance on an approach they recommend: https://cookbook.openai.com/examples/whisper_processing_guid...

> At times, files with long silences at the beginning can cause Whisper to transcribe the audio incorrectly. We'll use Pydub to detect and trim the silence.

(Official OpenAI quote)


What's VAD?


Voice Activity Detection (it predicts whether a short clip contains speech, eg to mute your microphone when you aren't speaking).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: