I used it like sibling commenter to get subtitles for downloaded videos. My hearing is bad. Whisper seems much better that YouTube's built-in auto-subtitles, so sometimes it is worth the extra trouble for me to download a video just to generate good subtitles and then watch it offline.
I also used whisper.cpp to transcribe all my hoarded podcast episodes. Took days of my poor old CPU working at 100% on all cores (and then a few shorter runs to transcribe new episodes I have downloaded since). Worked as good as I could possibly hope. Of course it gets the spelling of names wrong, but I don't expect anything (or anyone) to do much better. It is great to be able to run ripgrep to find old episodes on some topic and sometimes now I read an episode instead of listen, or listen to it with mpv with subtitles.
Start playing a YouTube video in the browser, select "start capture" in the extension, and it starts writing subtitles in white text on a black background below the video. When you stop capturing you can download the subtitles as a standard .srt file.
Aside from accessibility as mentioned, you can catch up on videos that are hours long. Orders of magnitude faster than watching on 3-4x playback speed. If you catch up through something like Subtitle Edit, you can also click on relevant parts of the transcript and replay it.
But transcribing and passably translating everything goes a long way too. Even if you can hear what's being said, it's still less straining to hear when there's captions for it.
Obviously one important factor to the convenience is how fast your computer is at transcription or translation. I don't use the features in real-time personally currently, although I'd like to if a great UX comes along through other software.
There's also a great podcast app opportunity here I hope someone seizes.
As a hard of hearing person, I can now download any video from the internet (e.g. youtube) and generate subtitles on the fly, not having to struggle to understand badly recorded or unintelligible speech.
Because it can use the full set of information of the audio - people with hearing difficulties cannot. Also interesting, people with perfectly functional hearing, but whom have "software" bugs (i.e. I find it extremely hard to process voices with significant background nose) can also benefit :)
I have that issue as well - I can hear faint noises OK but if there's background noise I can't understand what people say. But I'm pretty sure there's a physical issue at the root of it in my case. The problem showed up after several practice sessions with a band whose guitarist insisted on always playing at full volume.
I'd love your thoughts on why it might be hardware. I reason that my hearing is generally fine - there's no issue picking apart loud complex music (I love breakcore!).
But play two songs at the same time, or try talking to me with significant background noise, and I seem to be distinctly impaired vs. most others.
If I concentrate, I can sometimes work through it.
My uninformed model is a pipeline of sorts, and some sort of pre-processing isn't turned on. So the stuff after it has a much harder job.
I don't have much beyond what I said. It happened to me after repeated exposure to dangerously loud sounds in a small room. I can hear faint sounds, but I have trouble with strong accents and I can't understand words if there's a lot of background noise. I noticed it shortly after I left that band, and I left because the last practice was so loud it felt like a drill boring into my ears.
I don't think I have any harder time appreciating complex music than I did before, but I'm more of a 60s-70s rock kinda guy and a former bass player, so I tend to focus more on the low end. Bass tends to be less complex because you can't fit as much signal into the waveform without getting unpleasant muddling.
And of course, just because we have similar symptoms doesn't mean the underlying causes are the same. My grandfather was hard of hearing so for all I know it's genetic and the timing was a coincidence. Who knows?
It seems to me your ability to discriminate has been impacted.
I have always pictured it working this way:
In the Cochlea, we have all the fine hair like sensors. The spread of them determines our range of frequencies, and this declines with age. Usually not too much, but could be as much as half. 10 to 12khz.
Good news in that is all the good stuff we crave is below 10khz. Don't sweat age related hearing loss too much.
The number of these sensors determines our ability to hear concurrent sounds, or complexity.
The shape of them impacts how loud sounds need to be to be heard.
Chances are, your loud exposure had harmonics that impacted many of these sensing hairs, but not in one place. The result is a loss of discrimination of concurrent sounds.
There are plenty to cover the frequency range, so things do not seem muffled or low. Their shape is good, not worn so you hear faint sounds well.
The lower number of them is the issue. Or, they are still there, just bent-- something prevents them from contrubuting.
Another way to think of this is in reverse:
Say you had 30 oscillators you could start at any frequency and time. How complex of a sound could you make? Now cut that in half.
You say issue, I say feature. It's a great way to just ignore boring babbling at parties or other social engagements where you're just not that engaged. Sort of like selective hearing in relationships, but used on a wider audience
I don’t mean to speak for OP, but it strikes me as rude to make light of someone’s disability in this way. I’d guess it has caused them a lot of frustration.
Your assumption leads you to believe that I do not also suffer from the same issue. Ever since I was in a t-bone accident and the side airbag went off right next to my head, I have a definite issue hearing voices in crowded and noisy rooms with poor sound insulation. Some rooms are much worse than others.
So when I say I call it a feature, it's something I actually deal with unlike your uncharitable assumption.
Sometimes, late at night when I'm trying to sleep, and I hear the grumble of a Harley, or my neighbors staggering to their door, I wonder: why do we not have earflaps, like we do eyelids?
The definition of "unintelligible" varies by person, especially by accent. Like, I got no problem with understanding the average person from Germany... but someone from the deep backwaters of Saxony, forget about that.
I don't know about much better, but I like Whisper's ability to subtitle foreign language content on YouTube that (somehow) doesn't have auto-generated subs. For example some relatively obscure comedy sketches from Germany where I'm not quite fluent enough to go by ear.
10 years ago you'd be searching through random databases to see if someone had synchronized subtitles for the exact copy of the video that you had. Or older lecture videos that don't have transcripts. Many courses had to, in order to comply with federal funding, but not all. And lots of international courses don't have this requirement at all (for example some great introductory CS/maths courses from German + Swiss institutions). Also think about taking this auto generated output and then generating summaries for lecture notes, reading recommendations - this sort of stuff is what LLMs are great at.
You can do some clever things like take the foreign sub, have Whisper also transcribe it and then ask a big model like Gemini to go line by line and check the translation to English. This can include accounting for common transcription errors or idiomatic difference between langauges. I do it in Cursor to keep track of what the model has changed and for easy rollback. It's often good enough to correct mis-heard words that would be garbled through a cheaper model. And you can even query the model to ask about why a particular translation was made and what would be a more natural way to say the same thing. Sometimes it even figures out jokes. It's not a fast or fully automatic process, but the quality can be extremely good if you put some time into reviewing.
Having 90% of this be possible offline/open access is also very impressive. I've not tried newer OSS models like Qwen3 but I imagine it'd do a decent job of the cleanup.
I forget which package I used, but it runs in Docker and can output a sub file directly (and it can auto-translate). Usually I generate the native language + English to compare, since the native generally has better transcription, but it helps the models if they have a decent translation to start from.