I've been using FFmpeg and Whisper to record and transcribe live police scanner audio for my city, and update it in real-time to a live website. It works great, with the expected transcription errors and hallucinations.
All the "Thanks for watching!" gave me a good chuckle.
Remind me of one of my own experiences with one of the Whisper model, where some random noise in the middle of the conversation was translated into "Don't forget to like and subscribe".
Really illustrate where the training data is coming from.