NVIDIA (NVDA, Financial) has launched its latest speech recognition model, Parakeet-TDT-0.6B-v2, which has quickly ascended to the top of the Hugging Face's Open ASR Leaderboard. This model, incorporating 600 million parameters, can transcribe one hour of audio in just one second, setting a new benchmark in the AI speech industry with its exceptional performance and speed.
Parakeet-TDT-0.6B-v2 is open-source under the Creative Commons CC-BY-4.0 license, allowing developers and businesses to utilize it freely, significantly reducing entry barriers and development costs. The model uses a combination of Fast Conformer encoder and TDT decoder architecture, optimized for NVIDIA GPUs, supporting models like A100, H100, T4, and V100. Its versatility extends to systems with as little as 2GB RAM, enabling broad deployment across small and large enterprises.
This model outperforms other solutions with a Word Error Rate (WER) of only 6.05%, and impressively 1.69% on the LibriSpeech clean test set. Despite its focus on English, NVIDIA emphasizes it did not use personal data during development, adhering to responsible AI practices and internal quality standards.