Why does "30B total, 3B active" matter for inference cost?

You pay for compute proportional to the active parameters, not the total. Nemotron 3 Nano 30B A3B runs at speeds and costs closer to a 3B dense model but draws on 30B parameters of learned knowledge. The MoE routing mechanism selects the relevant subset per token.

How does the Mamba architecture enable the context of 262.1K tokens?

Mamba layers process sequences with linear-time complexity rather than the quadratic scaling of standard attention. That makes it practical to hold 262.1K tokens in context without the memory explosion that would make pure-attention models infeasible at that length.

How does Nemotron 3 Nano 30B A3B differ from Nemotron Nano 9B v2?

They use different architectures. Nemotron 3 Nano 30B A3B is a sparse MoE with 30B total/3B active parameters and a context window of 262.1K tokens. Nemotron Nano 9B v2 is a dense 9B model with a 128K-token context window. Choose Nemotron 3 Nano 30B A3B for throughput across multi-agent systems and Nano 9B v2 as a compact reasoning model.

Where are hosted input and output prices listed?

Current pricing is shown on this page. AI Gateway routes across providers, and rates may vary by provider.

Nemotron 3 Nano 30B A3B

Nemotron 3 Nano 30B A3B is a sparse hybrid Mamba-Transformer mixture-of-experts (MoE) model with 30B total parameters but only 3B active per token. It supports a context window of 262.1K tokens with throughput closer to a 3B dense model than a 30B one.

Reasoning

import { streamText } from 'ai'

const result = streamText({
  model: 'nvidia/nemotron-3-nano-30b-a3b',
  prompt: 'Why is the sky blue?'
})

Overview Playground About Providers Throughput Latency Uptime Status Similar FAQ

Frequently Asked Questions

Why does "30B total, 3B active" matter for inference cost?
You pay for compute proportional to the active parameters, not the total. Nemotron 3 Nano 30B A3B runs at speeds and costs closer to a 3B dense model but draws on 30B parameters of learned knowledge. The MoE routing mechanism selects the relevant subset per token.
How does the Mamba architecture enable the context of 262.1K tokens?
Mamba layers process sequences with linear-time complexity rather than the quadratic scaling of standard attention. That makes it practical to hold 262.1K tokens in context without the memory explosion that would make pure-attention models infeasible at that length.
How does Nemotron 3 Nano 30B A3B differ from Nemotron Nano 9B v2?
They use different architectures. Nemotron 3 Nano 30B A3B is a sparse MoE with 30B total/3B active parameters and a context window of 262.1K tokens. Nemotron Nano 9B v2 is a dense 9B model with a 128K-token context window. Choose Nemotron 3 Nano 30B A3B for throughput across multi-agent systems and Nano 9B v2 as a compact reasoning model.
Where are hosted input and output prices listed?
Current pricing is shown on this page. AI Gateway routes across providers, and rates may vary by provider.

AI Cloud

Core Platform

Security

Company

Learn

Open Source

Use Cases

Tools

Users

Nemotron 3 Nano 30B A3B

Frequently Asked Questions