Nvidia Nemotron Nano 12B V2 VL is an open 12B multimodal reasoning model from NVIDIA, released on December 1, 2024. It handles document intelligence and video understanding tasks, letting agents extract, interpret, and act on information across text, images, tables, and videos in a single model. At launch, Nvidia Nemotron Nano 12B V2 VL cited OCRBenchV2, reflecting document-level optical character recognition (OCR) and structured extraction capability.
The architecture is a hybrid Mamba-Transformer, the same design philosophy as the broader Nemotron family but applied to vision-language tasks. For video inputs, the model implements Efficient Video Sampling (EVS). EVS identifies and prunes temporally static patches, reducing token redundancy and processing longer video clips with up to 2.5x higher throughput without accuracy loss.
Nvidia Nemotron Nano 12B V2 VL runs on vLLM and TRT-LLM inference engines. Embedding and retrieval models in the same family appear on leaderboards such as ViDoRe, MTEB, and MMTEB for visual, multimodal, and multilingual text retrieval. The NVIDIA AI Blueprint for video search and summarization (VSS) is built around this model, making it a practical foundation for production video intelligence pipelines. Announcement and agent-focused context: https://deepinfra.com/nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL.