Indeed, with another model I would get persistent transcriptions of silent parts...

rollcat · 2025-07-22T07:45:22 1753170322

When I was taught mathematics, the zero value was always considered the most important edge case. You prove something for N=0 (or N=1), then for N=M+1.

It's even more important in audio DSP: processing near-zeroes can end up being extremely CPU intensive, look up denormal/subnormal floats.

inglor_cz · 2025-07-22T08:12:57 1753171977

Yeah, I studied mathematics (algebra and number theory) and zero is the point, often sporting discontinuities, or weird asymptotic behavior.

Quite a lot of algorithms use some form of division and zero is the only number in our typical structures (Z, Q, R, C), that cannot be used to divide with.

edwcross · 2025-07-22T09:34:54 1753176894

In machine integer arithmetics, one must also beware division by -1, which can convert MIN_INT into MIN_INT with a signed overflow and violate some arithmetics invariants, such as sign (negative divided by negative is _usually_ positive).

isoprophlex · 2025-07-22T08:19:02 1753172342

Well, now in this brave new age of AI we can enjoy computer programs crashing with an

    Error: division by please upvote, share and like!

xyproto · 2025-07-22T08:29:14 1753172954

This also works; I upvoted your comment.

o1bf2k25n8g5 · 2025-07-22T09:28:06 1753176486

I have discovered a truly marvelous proof of how to smash that like and subscribe button, which this comment box is too small to contain.

msopena · 2025-07-22T09:33:05 1753176785

Signed by Pierre de FermAIt

Bluestein · 2025-07-22T10:51:51 1753181511

KeplerBoy · 2025-07-22T10:35:43 1753180543

Denormals are flushed to zero by default on most GPUs by the way.

rollcat · 2025-07-23T10:08:24 1753265304

Makes total sense, execution time is bounded. The point is it's still a case you must consider (what if near-zero is distinct from zero and significant?)

wahnfrieden · 2025-07-22T08:24:33 1753172673

whisper MUST be combined with silence detection / VAD

pferde · 2025-07-22T08:44:14 1753173854

Ah, the good old "you're holding it wrong".

What good is a speech recognition tool that literally hears imaginary voices?

zettabomb · 2025-07-22T10:27:23 1753180043

Considering that if you DO use VAD (voice activity detection), it's the best open weights voice recognition model by a very wide margin, it's quite good. I'd be willing to be that commercial products that "don't have this problem" are using VAD as well, and that this is well known to them. But Whisper is just the weights, and I suppose a simple reference implementation, not a full product.

bmacho · 2025-07-22T10:46:58 1753181218

> What good is a speech recognition tool that literally hears imaginary voices?

Well, if it is supposed to work after silence detection, then it is good for speech recognition I guess. It's like blaming a wheel why is it circular, you can't sit on it. It's a part of a larger machine.

dumbfounder · 2025-07-22T13:05:06 1753189506

Just lay the wheel on its side and it makes a fine seat.

nhecker · 2025-07-22T16:50:26 1753203026

>imaginary voices

On the other hand, I can imagine that when things get quiet and the signal-to-noise ratio gets close to zero, random background audio (or randomness introduced in the transcription model) will be enough to tickle a critical number of neurons and elicit hallucinations.

The related thought exercise is this: Try scanning across the band with an AM or sideband radio, and after a while your brain will start to wonder "was that a voice I just heard, or music perhaps?" when in reality it was just environmental static.

wahnfrieden · 2025-07-22T11:28:06 1753183686

Yes, you are holding it wrong. The good of it is that it does not output imaginary voices when used with VAD.

Show us a technology with better results that does not use VAD. If you can’t, then I’m not sure what you’re arguing against except superficialities so inconsequential that I can’t comprehend the condescension. The results speak for itself

Xmd5a · 2025-07-22T09:05:46 1753175146

faster-whisper has a min_silence_duration_ms option

wahnfrieden · 2025-07-22T11:29:37 1753183777

There are much higher quality VAD solutions available

DANmode · 2025-07-23T23:30:26 1753313426

Please name a couple to get someone started who's hacking on webapps?

I'd really appreciate it.

DANmode · 2025-07-24T03:23:57 1753327437

(as would future readers, I'm sure)

DANmode · 2025-07-24T05:50:36 1753336236

https://github.com/ten-framework/ten-vad

wahnfrieden · 2025-07-24T08:15:10 1753344910

I last used silero but haven’t kept up with stage of the art so didn’t mention it

xandrius · 2025-07-22T11:52:39 1753185159

So if a tool has a process to have it perform at its best then it's a problem?

Do you also moan that before applying glue to a surface or it won't stick? Or if you need to drill a guiding hole before making a larger one in wood? Or that you need to use truly prime numbers for a security key to actually be safe?

DANmode · 2025-07-22T22:00:10 1753221610

What's a good starter VAD lib, and if you know, the best implementation of something like this to use in a browser-based app?

Say if I wanted to use it for Voice Nav, or Voice Input, but not piss off random people speaking the wrong language.

cmiles74 · 2025-07-22T12:31:08 1753187468

If that's truly the case then they should make it part of the product, IMHO.

wahnfrieden · 2025-07-22T16:23:28 1753201408

How is it not the case? It is unusable without VAD or editing. I don't understand what you're questioning

I agree their products could be better "end to end" integrated. Meanwhile there is a continuously-improving field of work for detecting speech (which Whisper is incapable of). They offer official "cookbooks" with guidance on an approach they recommend: https://cookbook.openai.com/examples/whisper_processing_guid...

> At times, files with long silences at the beginning can cause Whisper to transcribe the audio incorrectly. We'll use Pydub to detect and trim the silence.

(Official OpenAI quote)

DANmode · 2025-07-22T09:06:24 1753175184

What's VAD?

maxbond · 2025-07-22T09:11:13 1753175473

Voice Activity Detection (it predicts whether a short clip contains speech, eg to mute your microphone when you aren't speaking).