Changelog
@cf/moonshotai/kimi-k2.5now available on Workers AI. A frontier-scale open-source model with a 256k context window, multi-turn tool calling, vision inputs, and structured outputs for agentic workloads. Read changelog to get started.- New Prompt caching documentation. Send the
x-session-affinityheader to route requests to the same model instance and maximize prefix cache hit rates across multi-turn conversations. - Redesigned Asynchronous Batch API with a pull-based system that processes queued requests as capacity becomes available, avoiding out-of-capacity errors for durable workflows.
@cf/nvidia/nemotron-3-120b-a12bnow available on Workers AI! A hybrid MoE model with 120B total parameters and 12B active, optimized for multi-agent and agentic AI workloads. Read changelog to get started.
@cf/deepgram/nova-3now supports 10 languages with regional variants for real-time transcription. Supported languages include English, Spanish, French, German, Hindi, Russian, Portuguese, Japanese, Italian, and Dutch — with regional variants likeen-GB,fr-CA, andpt-BR.
@cf/openai/gpt-oss-120band@cf/openai/gpt-oss-20bnow support Chat Completions API format. Use/v1/chat/completionswith amessagesarray, or use/ai/runwhich dynamically detects your input format and accepts Chat Completions (messages), legacy Completions (prompt), or Responses API (input).- [Bug fix] Fixed a bug in the schema for multiple text generation models where the
contentfield in message objects only accepted string values. The field now properly accepts both string content and array content (structured content parts for multi-modal inputs). This fix applies to all affected chat models including GPT-OSS models, Llama 3.x, Mistral, Qwen, and others. - [Bug fix] Tool call round-trips now work correctly. The binding no longer rejects
tool_call_idvalues that it generated itself, fixing issues with multi-turn tool calling conversations. - [Bug fix] Assistant messages with
content: nullandtool_callsare now accepted in both the Workers AI binding and REST API (/v1/chat/completions), fixing tool call round-trip failures. - [Bug fix] Streaming responses now correctly report
finish_reasononly on the usage chunk, matching OpenAI's streaming behavior and preventing duplicate finish events. - [Bug fix]
/v1/chat/completionsnow preserves original tool call IDs from models instead of regenerating them. Previously, the endpoint was generating new IDs which broke multi-turn tool calling because AI SDK clients could not match tool results to their original calls. - [Bug fix]
/v1/chat/completionsnow correctly reportsfinish_reason: "tool_calls"in the final usage chunk when tools are used. Previously, it was hardcodingfinish_reason: "stop"which caused AI SDK clients to think the conversation was complete instead of executing tool calls.
@cf/zai-org/glm-4.7-flashis now available on Workers AI! A fast and efficient multilingual text generation model optimized for multi-turn tool calling across 100+ languages. Read changelog to get started.- New
@cloudflare/tanstack-aipackage for using Workers AI and AI Gateway with TanStack AI. workers-ai-provider v3.1.1adds transcription, text-to-speech, and reranking capabilities.
@cf/black-forest-labs/flux-2-klein-9bnow available on Workers AI! Read changelog to get started
@cf/black-forest-labs/flux-2-klein-4bnow available on Workers AI! Read changelog to get started
- Check out updated pricing on the
@cf/deepgram/fluxmodel page or pricing page - Pricing will start Dec 8, 2025
@cf/black-forest-labs/flux-2-devnow available on Workers AI! Read changelog to get started
@cf/qwen/qwen3-30b-a3b-fp8and@cf/qwen/qwen3-embedding-0.6bnow available on Workers AI
- Deepgram Aura 2 brings new text-to-speech capabilities to Workers AI. Check out
@cf/deepgram/aura-2-enand@cf/deepgram/aura-2-eson how to use the new models. - IBM Granite model is also up! This new LLM model is small but mighty, take a look at the docs for more
@cf/ibm-granite/granite-4.0-h-micro
- We're excited to be a launch partner with Deepgram and offer their new Speech Recognition model built specifically for enabling voice agents. Check out Deepgram's blog for more details on the release.
- Access the model through
@cf/deepgram/fluxand check out the changelog for in-depth examples.
- We've added support for some regional models on Workers AI in support of uplifting local AI labs and AI sovereignty. Check out the full blog post here.
@cf/pfnet/plamo-embedding-1bcreates embeddings from Japanese text.@cf/aisingapore/gemma-sea-lion-v4-27b-itis a fine-tuned model that supports multiple South East Asian languages, including Burmese, English, Indonesian, Khmer, Lao, Malay, Mandarin, Tagalog, Tamil, Thai, and Vietnamese.@cf/ai4bharat/indictrans2-en-indic-1Bis a translation model that can translate between 22 indic languages, including Bengali, Gujarati, Hindi, Tamil, Sanskrit and even traditionally low-resourced languages like Kashmiri, Manipuri and Sindhi.
- Our Markdown conversion utility now supports converting
.docxand.odtfiles.
- Workers AI types got updated in the upcoming wrangler release, please use
npm i -D wrangler@latestto update your packages. - EmbeddingGemma model accuracy has been improved, we recommend re-indexing data to take advantage of the improved accuracy
- Some older Workers AI models are being deprecated on October 1st, 2025. We reccommend you use the newer models such as Llama 4 and gpt-oss. The following models are being deprecated:
- @hf/thebloke/zephyr-7b-beta-awq
- @hf/thebloke/mistral-7b-instruct-v0.1-awq
- @hf/thebloke/llama-2-13b-chat-awq
- @hf/thebloke/openhermes-2.5-mistral-7b-awq
- @hf/thebloke/neural-chat-7b-v3-1-awq
- @hf/thebloke/llamaguard-7b-awq
- @hf/thebloke/deepseek-coder-6.7b-base-awq
- @hf/thebloke/deepseek-coder-6.7b-instruct-awq
- @cf/deepseek-ai/deepseek-math-7b-instruct
- @cf/openchat/openchat-3.5-0106
- @cf/tiiuae/falcon-7b-instruct
- @cf/thebloke/discolm-german-7b-v1-awq
- @cf/qwen/qwen1.5-0.5b-chat
- @cf/qwen/qwen1.5-7b-chat-awq
- @cf/qwen/qwen1.5-14b-chat-awq
- @cf/tinyllama/tinyllama-1.1b-chat-v1.0
- @cf/qwen/qwen1.5-1.8b-chat
- @hf/nexusflow/starling-lm-7b-beta
- @cf/fblgit/una-cybertron-7b-v2-bf16
- We’re excited to be a launch partner alongside Google to bring their newest embedding model to Workers AI. We're excited to introduce EmbeddingGemma delivers best-in-class performance for its size, enabling RAG and semantic search use cases. Take a look at
@cf/google/embeddinggemma-300mfor more details. Now available to use for embedding in AI Search too.
- Read the blog for more details
@cf/deepgram/aura-1is a text-to-speech model that allows you to input text and have it come to life in a customizable voice@cf/deepgram/nova-3is speech-to-text model that transcribes multilingual audio at a blazingly fast speed@cf/pipecat-ai/smart-turn-v2helps you detect when someone is done speaking@cf/leonardo/lucid-originis a text-to-image model that generates images with sharp graphic design, stunning full-HD renders, or highly specific creative direction@cf/leonardo/phoenix-1.0is a text-to-image model with exceptional prompt adherence and coherent text- WebSocket support added for audio models like
@cf/deepgram/aura-1,@cf/deepgram/nova-3,@cf/pipecat-ai/smart-turn-v2
- Check out the blog for more details about the new models
- Take a look at the
gpt-oss-120bandgpt-oss-20bmodel pages for more information about schemas, pricing, and context windows
- We've updated our documentation to reflect the correct pricing for melotts: $0.0002 per audio minute, which is actually cheaper than initially stated. The documented pricing was incorrect, where it said users would be charged based on input tokens.
- llama-3.2-1b-instruct - updated context window to the accurate 60,000
- whisper-large-v3-turbo - new hyperparameters available
- llama-guard-3-8b - the messages array must alternate between
userandassistantto function correctly
- We fixed a bug where
max_tokensdefaults were not properly being respected -max_tokensnow correctly defaults to256as displayed on the model pages. Users relying on the previous behaviour may observe this as a breaking change. If you want to generate more tokens, please set themax_tokensparameter to what you need. - We updated model pages to show context windows - which is defined as the tokens used in the prompt + tokens used in the response. If your prompt + response tokens exceed the context window, the request will error. Please set
max_tokensaccordingly depending on your prompt length and the context window length to ensure a successful response.
- Meta Llama 3.2 1B, 3B, and 11B vision is now available on Workers AI
@cf/black-forest-labs/flux-1-schnellis now available on Workers AI- Workers AI is fast! Powered by new GPUs and optimizations, you can expect faster inference on Llama 3.1, Llama 3.2, and FLUX models.
- No more neurons. Workers AI is moving towards unit-based pricing
- Model pages get a refresh with better documentation on parameters, pricing, and model capabilities
- Closed beta for our Run Any* Model feature, sign up here
- Check out the product announcements blog post for more information
- And the technical blog post if you want to learn about how we made Workers AI fast
Workers AI now suppoorts Meta Llama 3.1.
- A new way to do function calling with Embedded function calling
- Published new
@cloudflare/ai-utilsnpm package - Open-sourced
ai-utils on Github
- Function calling is now supported on enabled models
- Properties added on models page to show which models support function calling
Workers AI now natively supports AI Gateway.
We will be deprecating @cf/meta/llama-2-7b-chat-int8 on 2024-06-30.
Replace the model ID in your code with a new model of your choice:
@cf/meta/llama-3-8b-instructis the newest model in the Llama family (and is currently free for a limited time on Workers AI).@cf/meta/llama-3-8b-instruct-awqis the new Llama 3 in a similar precision to your currently selected model. This model is also currently free for a limited time.
If you do not switch to a different model by June 30th, we will automatically start returning inference from @cf/meta/llama-3-8b-instruct-awq.
- Added documentation on new public LoRAs.
- Noted that you can now run LoRA inference with the base model rather than explicitly calling the
-loraversion
Added OpenAI compatible API endpoints for /v1/chat/completions and /v1/embeddings. For more details, refer to Configurations.
- Added new AI native binding, you can now run models with
const resp = await env.AI.run(modelName, inputs) - Deprecated
@cloudflare/ainpm package. While existing solutions using the @cloudflare/ai package will continue to work, no new Workers AI features will be supported. Moving to native AI bindings is highly recommended