nemotron-3-120b-a12b

Text Generation • NVIDIA • Hosted

NVIDIA Nemotron 3 Super is a hybrid MoE model with leading accuracy for multi-agent applications and specialized agentic AI systems.

Model Info
Context Window ↗	256,000 tokens
Terms and License	link ↗
Function calling ↗	Yes
Reasoning	Yes
Unit Pricing	$0.50 per M input tokens, $1.50 per M output tokens

Playground

Try out this model with Workers AI LLM Playground. It does not require any setup or authentication and an instant way to preview and test a model directly in the browser.

Launch the LLM Playground

Usage

export interface Env {
  AI: Ai;
}

export default {
  async fetch(request, env): Promise<Response> {

    const messages = [
      { role: "system", content: "You are a friendly assistant" },
      {
        role: "user",
        content: "What is the origin of the phrase Hello, World",
      },
    ];

    const stream = await env.AI.run("@cf/nvidia/nemotron-3-120b-a12b", {
      messages,
      stream: true,
    });

    return new Response(stream, {
      headers: { "content-type": "text/event-stream" },
    });
  },
} satisfies ExportedHandler<Env>;

export interface Env {
  AI: Ai;
}

export default {
  async fetch(request, env): Promise<Response> {

    const messages = [
      { role: "system", content: "You are a friendly assistant" },
      {
        role: "user",
        content: "What is the origin of the phrase Hello, World",
      },
    ];
    const response = await env.AI.run("@cf/nvidia/nemotron-3-120b-a12b", { messages });

    return Response.json(response);
  },
} satisfies ExportedHandler<Env>;

import os
import requests

ACCOUNT_ID = "your-account-id"
AUTH_TOKEN = os.environ.get("CLOUDFLARE_AUTH_TOKEN")

prompt = "Tell me all about PEP-8"
response = requests.post(
  f"https://api.cloudflare.com/client/v4/accounts/{ACCOUNT_ID}/ai/run/@cf/nvidia/nemotron-3-120b-a12b",
    headers={"Authorization": f"Bearer {AUTH_TOKEN}"},
    json={
      "messages": [
        {"role": "system", "content": "You are a friendly assistant"},
        {"role": "user", "content": prompt}
      ]
    }
)
result = response.json()
print(result)

curl https://api.cloudflare.com/client/v4/accounts/$CLOUDFLARE_ACCOUNT_ID/ai/run/@cf/nvidia/nemotron-3-120b-a12b \
  -X POST \
  -H "Authorization: Bearer $CLOUDFLARE_AUTH_TOKEN" \
  -d '{ "messages": [{ "role": "system", "content": "You are a friendly assistant" }, { "role": "user", "content": "Why is pizza so good" }]}'

Parameters

Input

prompt

stringrequiredminLength: 1The input text prompt for the model to generate a response.

model

stringID of the model to use (e.g. '@cf/zai-org/glm-4.7-flash, etc').

▶audio{}

objectParameters for audio output. Required when modalities includes 'audio'.

frequency_penalty

number | nullPenalizes new tokens based on their existing frequency in the text so far.

logit_bias

object | nullModify the likelihood of specified tokens appearing in the completion. Maps token IDs to bias values from -100 to 100.

logprobs

boolean | nullWhether to return log probabilities of the output tokens.

top_logprobs

integer | nullHow many top log probabilities to return at each token position (0-20). Requires logprobs=true.

max_tokens

integer | nullDeprecated in favor of max_completion_tokens. The maximum number of tokens to generate.

max_completion_tokens

integer | nullAn upper bound for the number of tokens that can be generated for a completion.

metadata

object | nullSet of 16 key-value pairs that can be attached to the object.

modalities

array | nullOutput types requested from the model (e.g. ['text'] or ['text', 'audio']).

integer | nullHow many chat completion choices to generate for each input message.

parallel_tool_calls

booleandefault: trueWhether to enable parallel function calling during tool use.

▶prediction{}

object

presence_penalty

number | nullPenalizes new tokens based on whether they appear in the text so far.

reasoning_effort

string | nullConstrains effort on reasoning for reasoning models (o1, o3-mini, etc.).

▶chat_template_kwargs{}

object

▶response_format

one ofSpecifies the format the model must output.

seed

integer | nullIf specified, the system will make a best effort to sample deterministically.

service_tier

string | nullSpecifies the processing type used for serving the request.

▶stop

one of

store

boolean | nullWhether to store the output for model distillation / evals.

stream

boolean | nullIf true, partial message deltas will be sent as server-sent events.

▶stream_options{}

object

temperature

number | nullSampling temperature between 0 and 2.

▶tool_choice

one ofControls which (if any) tool is called by the model. 'none' = no tools, 'auto' = model decides, 'required' = must call a tool.

▶tools[]

arrayA list of tools the model may call.

top_p

number | nullNucleus sampling: considers the results of the tokens with top_p probability mass.

user

stringA unique identifier representing your end-user, for abuse monitoring.

▶web_search_options{}

objectOptions for the web search tool (when using built-in web search).

▶function_call

one of

▶functions[]

arrayminItems: 1maxItems: 128

Output

Synchronous — Send a request and receive a complete response

stringA unique identifier for the chat completion.

object

string

created

integerUnix timestamp (seconds) of when the completion was created.

model

stringThe model used for the chat completion.

▶choices[]

arrayminItems: 1

▶usage{}

object

system_fingerprint

string | null

service_tier

string | null

Streaming — Send a request with `stream: true` and receive server-sent events

type

string

contentType

text/event-stream

format

binary

API Schemas (Raw)

Synchronous Input

Synchronous Output

Streaming Input

Streaming Output