NVIDIA's Nemotron 3 Ultra: Open 550B MoE Built for Long-Running Agents

NVIDIA has shipped Nemotron 3 Ultra, a fully open 550B mixture-of-experts model that keeps just 55B parameters active at inference time — delivering 5× faster throughput and up to 30% lower cost than comparable open frontier models. Three independent intelligence batches flagged this as the week's headline open-model event. LangChain simultaneously announced Day-0 Deep Agent support as a Nemotron Coalition founding member, and Hugging Face shipped day-0 integration into Transformers on release day.

What the Source Actually Says

The efficiency gains trace directly to architecture. Nemotron 3 Ultra uses a hybrid Mamba-2 / Transformer design at roughly a 4:1 Mamba-to-Attention ratio — fewer full-attention layers means faster generation for the long-context agentic workloads NVIDIA is explicitly targeting. The model was pretrained on 20 trillion tokens in NVFP4 precision, uses a LatentMoE architecture, and went through two-stage MOPD post-training plus native multi-token prediction for speculative decoding. Context window is 1 million tokens. Both BF16 and NVFP4-quantized weights are available (~350 GB in BF16), with reasoning on/off toggle, tool use, and multilingual support built in.

NVIDIA's release posture is unusually comprehensive: base weights, post-trained weights, reward-model checkpoints, NVFP4 quantized versions, full training data, and training recipes are all published on Hugging Face under the OpenMDW-1.1 license (Linux Foundation). NVIDIA adopted OpenMDW-1.1 simultaneously across its Cosmos, Isaac GR00T, Ising, and Nemotron families — a clear signal that open-model licensing infrastructure is being standardized for the physical and agentic AI stacks in one move. A companion agentic safety dataset also shipped the same day: 1,272 synthetic red-teaming records spanning nine enterprise domains designed to test tool-using agent resistance to indirect prompt injection.

Strategic Take

The Nemotron Coalition's Day-0 formation — LangChain, Hugging Face, pre-staged for launch — suggests NVIDIA is building ecosystem lock-in through open models the same way cloud providers built it through regions. Teams evaluating open-frontier APIs for long-running agentic workloads should benchmark the 5× inference claim against their own token costs; the architectural bet on Mamba-2 efficiency is structural, not just a quantization trick.