I'm so amazed to find out just how close we are to the start trek voice computer.
I used to use Dragon Dictation to draft my first novel, had to learn a 'language' to tell the rudimentary engine how to recognize my speech.
And then I discovered [1] and have been using it for some basic speech recognition, amazed at what a local model can do.
But it can't transcribe any text until I finish recording a file, and then it starts work, so very slow batches in terms of feedback latency cycles.
And now you've posted this cool solution which streams audio chunks to a model in infinite small pieces, amazing, just amazing.
Now if only I can figure out how to contribute to Handy or similar to do that Speech To Text in a streaming mode, STT locally will be a solved problem for me.
Happy to answer questions about this (or work with people on further optimizing the open source inference code here). NVIDIA has more inference tooling coming, but it's also fun to hack on the PyTorch/etc stuff they've released so far.
Thank you for sharing! Does your implementation allow running the Nemotron model on Vulkan? Like whisper.cpp? I'm curious to try other models, but I don't have Nvidia, so my choices are limited.
I’m curious about this too. On my M1 Max MacBook I use the Handy app on macOS with Parakeet V3 and I get near instant transcription, accuracy slightly less than slower Whisper models, but that drop is immaterial when talking to CLI coding agents, which is where I find the most use for this.
Yeah, I think the multilingual improvements in V3 caused some kind of regression for English - I've noticed large blocks occasionally dropped as well, so reverted to v2 for my usage. Specifically nvidia/parakeet-tdt-0.6b-v2 vs nvidia/parakeet-tdt-0.6b-v3
I didn’t see that but I do get a lot of stutters (words or syllables repeated 5+ times), not sure if it’s a model problem or post processing issue in the Handy app.
Parakeet is really good imo too, and it's just 0.6B so it can actually run on edge devices. 4B is massive, I don't see Voxtral running realtime on an Orin or fitting on a Hailo. An Orin Nano probably can't even load it at BF16.
Not so obvious, because the model still needs to look up the required doc. The article glances over this detail a little bit unfortunately. The model needs to decide when to use a skill, but doesn’t it also need to decide when to look up documentation instead of relying on pretraining data?
Removing the skill does remove a level of indirection.
It's a difference of "choose whether or not to make use of a skill that would THEN attempt to find what you need in the docs" vs. "here's a list of everything in the docs that you might need."
I believe the skills would contain the documentation. It would have been nice for them to give more information on the granularity of the skills they created though.
It goes through the "reject all tracking" flow. Other solutions automate clicking "accept all tracking" (since that's usually simpler), or just hide the pop-ups.
Weirdly, I find a higher signal to noise in this analogy than looking at benchmarks these days.
If you let your inner fanboy rest for a moment you realize Gemini 3, Claude Opus 4.5, and GPT 5.2 are all amazing. If two of them disappeared tomorrow, my AI assisted productiveness wouldn't change.
The 3% difference on benchmark X doesn't mean anything anymore. It's probably more helpful to compare them on character traits instead of numbers.
My one word to describe Claude would be "pleasant". It's just so nice to communicate with. GPT/Codex would be the "thorough". It finds and thinks of stuff the others don't. For Gemini 3, the jury is still out. It might be the smart kid on the block that's still a bit rough around the edges, but given that it's a preview things might change soon.
Mine definitely would. This sounds so clichéd, but Claude (Opus 4.5, but also the others) just "gets how I think" better. I've tried Gemini 3 and GPT 5.2 and didn't like them at all -- not when I know I can have Claude. I mostly code Python + Django, so it could also be from that.
Gemini 3 has this extremely annoying habit of bleeding its reasoning process onto comments which are hard to read and not very human-like (they're not "reasoning", they're "question for the sake of questioning", which I get as a part of the process, but not as a comment in the code!). I've seen it do things like these many times:
# Because so and so and so and so we must do x(param1=True, param2=False)
# Actually! No, wait! It is better if we do x(param1=True, param2=True)
x(param1=True, param2=True, param3=False) # This one is even better!
Beyond that, it just does not produce what I consider good python code. I daily-drove Gemini 2.5 before I realized how good Anthropic's models were (or perhaps before they punched back after 2.5?) and haven't been able to go back.
As for GPT 5.2, I just feel like it doesn't really follow my instructions or way of thinking. Like it's dead set on following whatever best practices it has learned, and if I disagree with them, well tough luck. Plus, and I have no better way of saying this, it's just rude and cold, and I hate it for it.
I recently discovered Claude, and it does much better than Codex or Gemini for python code.
Gemini seems to lean to making everything a script, disconnected from the larger vision. Sure, it uses our existing libraries, but the files it writes and functions it makes can’t be integrated back in.
Codex is fast. Very fast. Which makes it great for a conversational UI, and answering questions about the codebasw or proposing alternatives but when it writes code it’s too clever. The code is valid but not pythonic. Like the invention of one line functions just to optimize a situation that had could be parameterized in three places.
Claude on the other hand makes code that is simple to understand and has enough architecture that you can lift it out and use as is without too much rewriting.
We deliberately chose not to use JSON Resume because we wanted greater flexibility. For example, in RenderCV, you can use any section title you want and place any of the 9 available entry types under any section. In contrast, JSON Resume has predefined section titles, and each section is restricted to a predefined entry type. For instance, you must use the experience entry schema under the experience section.
I hear you. This boils town to personal opinion. I would have preferred to use an existing standard than introducing yet another one. The custom sections aren't something I've ever seen or needed anyway.
reply