Skip to content
NVIDIA logo

nemotron-3-120b-a12b

Text GenerationNVIDIAHosted

NVIDIA Nemotron 3 Super is a hybrid MoE model with leading accuracy for multi-agent applications and specialized agentic AI systems.

Model Info
Context Window256,000 tokens
Terms and Licenselink
Function calling Yes
ReasoningYes
Unit Pricing$0.50 per M input tokens, $1.50 per M output tokens

Playground

Try out this model with Workers AI LLM Playground. It does not require any setup or authentication and an instant way to preview and test a model directly in the browser.

Launch the LLM Playground

Usage

TypeScript
export interface Env {
AI: Ai;
}
export default {
async fetch(request, env): Promise<Response> {
const messages = [
{ role: "system", content: "You are a friendly assistant" },
{
role: "user",
content: "What is the origin of the phrase Hello, World",
},
];
const stream = await env.AI.run("@cf/nvidia/nemotron-3-120b-a12b", {
messages,
stream: true,
});
return new Response(stream, {
headers: { "content-type": "text/event-stream" },
});
},
} satisfies ExportedHandler<Env>;

Parameters

Input

prompt
stringrequiredminLength: 1The input text prompt for the model to generate a response.
model
stringID of the model to use (e.g. '@cf/zai-org/glm-4.7-flash, etc').
frequency_penalty
number | nullPenalizes new tokens based on their existing frequency in the text so far.
logit_bias
object | nullModify the likelihood of specified tokens appearing in the completion. Maps token IDs to bias values from -100 to 100.
logprobs
boolean | nullWhether to return log probabilities of the output tokens.
top_logprobs
integer | nullHow many top log probabilities to return at each token position (0-20). Requires logprobs=true.
max_tokens
integer | nullDeprecated in favor of max_completion_tokens. The maximum number of tokens to generate.
max_completion_tokens
integer | nullAn upper bound for the number of tokens that can be generated for a completion.
metadata
object | nullSet of 16 key-value pairs that can be attached to the object.
modalities
array | nullOutput types requested from the model (e.g. ['text'] or ['text', 'audio']).
n
integer | nullHow many chat completion choices to generate for each input message.
parallel_tool_calls
booleandefault: trueWhether to enable parallel function calling during tool use.
presence_penalty
number | nullPenalizes new tokens based on whether they appear in the text so far.
reasoning_effort
string | nullConstrains effort on reasoning for reasoning models (o1, o3-mini, etc.).
seed
integer | nullIf specified, the system will make a best effort to sample deterministically.
service_tier
string | nullSpecifies the processing type used for serving the request.
store
boolean | nullWhether to store the output for model distillation / evals.
stream
boolean | nullIf true, partial message deltas will be sent as server-sent events.
temperature
number | nullSampling temperature between 0 and 2.
top_p
number | nullNucleus sampling: considers the results of the tokens with top_p probability mass.
user
stringA unique identifier representing your end-user, for abuse monitoring.

Output

Synchronous — Send a request and receive a complete response
id
stringA unique identifier for the chat completion.
object
string
created
integerUnix timestamp (seconds) of when the completion was created.
model
stringThe model used for the chat completion.
system_fingerprint
string | null
service_tier
string | null
Streaming — Send a request with `stream: true` and receive server-sent events
type
string
contentType
text/event-stream
format
binary

API Schemas (Raw)

Synchronous Input
Synchronous Output
Streaming Input
Streaming Output