NVIDIA Nemotron 3 Nano Omni now available on Crusoe Managed Inference

Table of contents

This is some text inside of a div block.

NVIDIA Nemotron open models empower developers to build specialized AI agents with leading efficiency and accuracy. NVIDIA Nemotron^TM 3 Nano Omni is now available on Crusoe Managed Inference.

This new addition expands the family of NVIDIA Nemotron 3 open models that you can consume via fully managed endpoints from within the Crusoe Intelligence Foundry. As an NVIDIA Cloud Partner offering day zero support, you get access to new models as soon as they’re released. We handle the optimization and the infrastructure complexities so you can build and ship faster.

Here's a breakdown of what’s available, what each model is built for, and the benefits of using these models on Crusoe Managed Inference.

Nemotron 3 Nano Omni

Open, multimodal reasoning across video, audio, and documents

NVIDIA Nemotron 3 Nano Omni is an open multimodal foundation model that unifies reasoning across video, audio, images, documents, and text, simplifying agentic AI development with leading efficiency and accuracy.

Built on a 30B-A3B-parameter Mixture of Experts (MoE) model with ~3B active parameters per forward pass and a 256K-token context window, Nemotron 3 Nano Omni is designed for the reality that enterprise data doesn't live in text alone, it spans PDFs with embedded charts, screen recordings, scanned contracts, voice memos.

Best for production use cases including:

Enterprise document intelligence
GUI-based agents
Video and audio reasoning
Multimodal RAG

Nemotron 3 Nano Omni combines NVIDIA Parakeet speech encoder for audio transcription, CRADIO for visual and document reasoning, and a dedicated GUI-trained visual system for computer use agents — enabling it to understand and act across interfaces, not just text. Efficient Video Sampling (EVS) and 3D convolution layers further optimize video reasoning by reducing redundant computation and enabling efficient temporal understanding across long video inputs.

Using this model, an AI system can achieve 9x higher throughput than other open omni models with the same interactivity, resulting in lower cost and better scalability without sacrificing responsiveness.

Crusoe's inference engine with MemoryAlloy^TM technology is purpose-built for workloads like these. At 256K tokens of shared multimodal context, managing cached visual, audio, and document state across agent turns is a hard infrastructure challenge. MemoryAlloy technology handles this efficiently without recomputation, keeping latency low and throughput high. Try Nemotron 3 Nano Omni in the Crusoe Intelligence Foundry.

NVIDIA Nemotron 3 Nano

High-accuracy reasoning at minimal inference cost

NVIDIA Nemotron 3 Nano is a highly compute-efficient, high-performance 30B parameter LLM. Utilizing a sparse Mixture-of-Experts (MoE) architecture, it activates ~3.2B parameters per forward pass, providing robust reasoning with lower inference cost.

Best for production use cases including:

Targeted agentic tasks
High-volume automation
Cost-sensitive pipelines

For multi-agent systems, Nemotron 3 Nano is an excellent choice for executing targeted, individual steps within an agentic workflow like simple merge requests, single-step tool calls, or discrete classification tasks. If you're running dozens or hundreds of parallel agents and need maximum tokens per second at the lowest cost, Nemotron 3 Nano is the right starting point.

Crusoe’s inference engine dramatically reduces KV cache overhead so you can run Nemotron 3 Nano with higher concurrent request throughput without sacrificing accuracy. Try Nemotron 3 Nano in the Crusoe Intelligence Foundry.

NVIDIA Nemotron 3 Super

Multi-agent reasoning at enterprise scale

NVIDIA Nemotron 3 Super is an advanced 120B total parameter, 12B active-parameter model optimized for demanding multi-agent applications like software development and cybersecurity triaging. Its design prioritizes compute efficiency, performance, and accuracy. This model features a hybrid Mamba-Transformer Latent MoE architecture, which provides a practical 1M-token context window. The integration of Mamba layers is key to maintaining a manageable memory footprint, allowing agents to effectively reason over large volumes of data, such as entire codebases, extensive conversation histories, or numerous retrieved documents.

Best for production use cases including:

Complex multi-step agents
Software development
IT automation
RAG over long documents

Multi-token prediction (MTP) layers in the Nemotron 3 Super deliver over 50% higher token generation compared to leading open models, while Latent MoE enables calling four experts for the inference cost of only one. It was trained across 10+ reinforcement learning environments, with strong benchmark performance on AIME 2025, TerminalBench, and SWE-Bench Verified.

In a software development context, simple merge requests can be handled by Nemotron 3 Nano, while complex tasks requiring deeper understanding of a codebase are ideal for Nemotron 3 Super. Expert-level coding tasks can be escalated to proprietary frontier models. This tiered architecture is exactly the pattern that drives cost-efficient, high-accuracy agentic pipelines.

Crusoe’s inference engine with MemoryAlloy technology is particularly impactful for Nemotron 3 Super. Long-context workloads with 100K-1M token histories see the greatest benefit from intelligent KV cache reuse, reducing redundant computation and lowering latency per request. Try Nemotron 3 Super in the Crusoe Intelligence Foundry.

NVIDIA Nemotron 3 VoiceChat

Real-time, full-duplex voice AI

The Nemotron 3 VoiceChat is a 12B-parameter, end-to-end speech-to-speech model. Its key innovation is a single, unified architecture that performs streaming speech understanding and speech generation in real-time, full-duplex communication. This design eliminates the latency and potential failure points associated with the traditional cascade of Automatic Speech Recognition (ASR), a Large Language Model (LLM), and Text-to-Speech (TTS) components.

Nemotron 3 VoiceChat is designed for extremely low latency, aiming for sub-300ms end-to-end response time by processing 80ms audio chunks faster than real-time. This is achieved through a single, unified architecture instead of multiple separate models. The benefit of this integrated approach includes fewer handoffs and simpler deployment, which leads to more natural and dynamic conversations, including better handling of pauses, backchanneling, and smooth turn-taking.

Best for use cases including:

Voice-native customer service
Healthcare agents
Financial services interactive voice response (IVR)
Gaming NPCs
Telecommunications

On the Artificial Analysis Speech-to-Speech leaderboard, Nemotron 3 VoiceChat is the only open-weights model to land in the top three on both conversational dynamics and speech reasoning simultaneously, making it the Pareto leader across both dimensions.

For voice AI workloads, Crusoe’s inference engine reduces the latency tail that makes or breaks user experience, keeping response times consistently low even under concurrent load. Try Nemotron 3 VoiceChat in the Crusoe Intelligence Foundry.

All four models are available now for serverless inference. Here's how to get started.

Get started with Crusoe Intelligence Foundry

Go from experimentation to production without spinning up infrastructure, wrestling with model weights, or filing a ticket.

Try models instantly before you build. Crusoe Intelligence Foundry includes an interactive playground where you can prompt any Nemotron 3 model directly. Test response quality, tune system prompts, and validate that the model fits your use case before writing a single line of code. Whether you're evaluating Nemotron 3 Nano for a high-volume classification task or exploring Nemotron 3 Super's reasoning on a complex multi-step workflow, you can see real outputs in real time.

Generate your API key in seconds. Once you're ready to integrate, Crusoe Intelligence Foundry lets you generate an API endpoint directly from the platform; no separate provisioning step, no infrastructure configuration required. Drop it into your application and you're live.

Monitor performance and usage. Get real-time visibility into critical inference metrics including time-to-first-token (TTFT), end-to-end latency, and throughput. Monitor input/output/cached token usage, and view your inference spend all in one place.

Ready to run any Nemotron 3 model in production? Explore Crusoe Managed Inference and get started today — or contact sales if you're evaluating at scale.

NVIDIA Nemotron 3 Nano Omni now available on Crusoe Managed Inference

Nemotron 3 Nano Omni

Open, multimodal reasoning across video, audio, and documents

NVIDIA Nemotron 3 Nano

High-accuracy reasoning at minimal inference cost

NVIDIA Nemotron 3 Super

Multi-agent reasoning at enterprise scale

NVIDIA Nemotron 3 VoiceChat

Real-time, full-duplex voice AI

Get started with Crusoe Intelligence Foundry

Latest articles

Are you ready to build something amazing?