I know nothing about Whisper, is this usable for automated translation? I own a ...

ethan_smith · 2025-08-13T12:53:03 1755089583

Whisper can indeed transcribe Japanese and translate it to English, though quality varies by dialect and audio clarity. You'll need the "large-v3" model for best results, and you can use ffmpeg's new integration with a command like `ffmpeg -i movie.mp4 -af whisper=model=large-v3:task=translate output.srt`.

waltbosz · 2025-08-13T13:38:51 1755092331

I wonder how the results of an AI Japanese-audio-to-English-subtitles would compare to a fansub-ed anime. I'm guessing it would be a more literal translation vs. contextual or cultural.

I found an interesting article about trollsubs, which I guess are fansubs made with a contemptuous flare. https://neemblog.home.blog/2020/08/19/the-lost-art-of-fan-ma...

Tangent: I'm one of those people who watch movies with closed captions. Anime is difficult because the subtitle track is often the original Japanese-to-English subtitles and not closed captions, so the text does not match the English audio.

chazeon · 2025-08-13T13:48:41 1755092921

I do japanese transcription + gemini translations. It’s worse than fansub, but its much much better than nothing. First thing that could struggle is actually the vad, then is special names and places, prompting can help but not always. Finally it’s uniformity (or style). I still feel that I can’t control the punctuation well.

numpad0 · 2025-08-13T16:22:19 1755102139

I was recently just playing around with Google Cloud ASR as well as smaller Whisper models, and I can say it hasn't gotten to that point: Japanese ASRs/STTs all generate final kanji-kana mixed text, and since kanji:pronunciation is n:n maps, it's non-trivial enough that it currently need hands from human native speakers to fix misheard texts in a lot of cases. LLMs should be theoretically good at this type of tasks, but they're somehow clueless about how Japanese pronunciation works, and they just rubber-stamp inputs as written.

The conversion process from pronunciation to intended text is not deterministic either, so it probably can't be solved by "simply" generating all-pronunciation outputs. Maybe a multimodal LLM as ASR/STT, or a novel dual input as-spoken+estimated-text validation model could be made? I wouldn't know, though. It seemed like a semi-open question.

neckro23 · 2025-08-13T15:53:57 1755100437

In my experience it works ok. The "English" model actually knows a lot of languages and will translate directly to English.

You can also transcribe it to Japanese and use a translator to convert to English. This can sometimes help for more semantically complex dialogue.

For example, using faster-whisper-xxl [1]:

Direct translation:

    faster-whisper-xxl.exe --language English --model large-v2 --ff_vocal_extract mdx_kim2 --vad_method pyannote_v3 --standard <input>

Use Japanese, then translate:

    faster-whisper-xxl.exe --language Japanese --task translate --model large-v2 --ff_vocal_extract mdx_kim2 --vad_method pyannote_v3 --standard <input>

1. https://github.com/Purfview/whisper-standalone-win

prmoustache · 2025-08-13T11:57:02 1755086222

My personnal experience trying to transcribe (not translate) was a complete failure. The thing would invent stuff. It would also be completely lost when more than one language is used.

It also doesn't understand contexts so does a lot of errors you see in automatic translations from videos in youtube for example.

okdood64 · 2025-08-13T13:31:49 1755091909

It's curious how YouTube's is so bad still given the current state of the art; but it has got a lot better in the last 6 months.

trenchpilgrim · 2025-08-13T11:53:14 1755085994

Whisper has quite bad issues with hallucination. It will inject sentences that were never said in the audio.

It's decent for classification but poor at transcription.

neckro23 · 2025-08-13T15:43:30 1755099810

Pre-processing with a vocal extraction model (bs-rofomer or similar) helps a lot with the hallucinations, especially with poor quality sources.

trenchpilgrim · 2025-08-13T17:11:55 1755105115

I'm working with fairly "clean" audio (voice only) and still see ridiculous hallucinations.

BetterWhisper · 2025-08-13T16:19:11 1755101951

Hey, indeed Whisper can do the transcription of Japanese and even the translation (but only to English). For the best results you need to use the largest model which depending on your hardware might be slow or fast.

Another option is to use something like VideoToTextAI which allows you to transcribe it fast and then translate it into 100+ languages which you can then export the subtitle (SRT) file for

poglet · 2025-08-13T11:35:53 1755084953

Yep, whisper can do that. You can also try whisperx (https://github.com/m-bain/whisperX) for a possibly better experience with aligning of subtitles to spoken words.

_def · 2025-08-13T11:43:36 1755085416

May I ask which movies? I'm just curious