Skip to content

NVIDIA Nemotron 3 Super 120B A12B

NVIDIA Nemotron 3 Super 120B A12B is NVIDIA's 120B total, 12B active-parameter hybrid Mamba-Transformer MoE built for complex multi-agent applications, featuring latent MoE and multi-token prediction.

index.ts
import { streamText } from 'ai'
const result = streamText({
model: 'nvidia/nemotron-3-super-120b-a12b',
prompt: 'Why is the sky blue?'
})

Frequently Asked Questions

  • What is latent MoE and why does it matter?

    Latent MoE compresses token embeddings into a smaller latent space before routing. This reduces per-expert compute cost and lets NVIDIA Nemotron 3 Super 120B A12B consult 4x as many experts for the same inference budget. Distinct experts activate for code generation, SQL logic, and natural language without the overhead of running them all densely.

  • What is multi-token prediction and how does it speed up inference?

    MTP trains the model to predict multiple future tokens in a single forward pass. At inference, the MTP heads provide draft tokens that can be verified in parallel, acting as built-in speculative decoding. This delivers wall-clock speedups for structured generation like code and tool calls, without requiring a separate draft model.

  • What is the "Super + Nano" deployment pattern?

    NVIDIA describes using Nano for straightforward individual steps in a pipeline and Super for complex decisions requiring deep reasoning. In software development, for example, Nano might handle routine merge requests while Super tackles tasks that require understanding a full codebase. This pattern distributes compute across task difficulty.

  • What is the context window of 256K tokens used for in multi-agent systems?

    Multi-agent systems generate high token volume (up to 15x that of standard chats) from tool outputs, reasoning steps, and history resent at each turn. A window of 256K tokens lets agents keep full session history, large codebases, and retrieved context in a single pass. This reduces goal drift from context truncation.

  • Where are hosted per-token prices?

    Rates are listed on this page. They reflect the providers routing through AI Gateway and shift when providers update their pricing.