Mistral Releases First Open-Source Frontier TTS Model, 17ms First-Audio

Mistral has released its first open-source frontier-grade text-to-speech model, presented at AI Engineer by AI Scientist Samuel Humeau. The model is approximately 4 billion parameters, achieves 17ms first-audio latency on a single GPU, and uses an autoregressive backbone with flow-matching diffusion heads to generate all 37 codec tokens per 80ms audio frame simultaneously. Mistral released the inference weights and a set of open voices but withheld the encoder used for voice cloning, retaining that capability for B2B deployments. Competing directly with ElevenLabs and Cartesia on quality while being open-weight is a structural change in the voice AI market.

Why It Matters

A 17ms first-audio latency at frontier quality, available as open weights, makes production-ready voice agents accessible to any team — collapsing a capability that was previously gated behind proprietary APIs. The architecture also plugs directly into existing LLM agent stacks.