Cloudflare Releases Agents SDK v0.5.0 with Rewritten @cloudflare/ai-chat and New Rust-Powered Infire Engine for Optimized Edge Inference Performance


Cloudflare has released the Agents SDK v0.5.0 to address the limitations of stateless serverless functions in AI development. In standard serverless architectures, every LLM call requires rebuilding the session context from scratch, which increases latency and token consumption. The Agents SDK’s latest version (Agents SDK v0.5.0) provides a vertically integrated execution layer where compute, state, and inference coexist at the network edge.

The SDK allows developers to build agents that maintain state over long durations, moving beyond simple request-response cycles. This is achieved through 2 primary technologies: Durable Objects, which provide persistent state and identity, and Infire, a custom-built Rust inference engine designed to optimize edge resources. For devs, this architecture removes the need to manage external database connections or WebSocket servers for state synchronization.

State Management via Durable Objects

The Agents SDK relies on Durable Objects (DO) to provide persistent identity and memory for every agent instance. In traditional serverless models, functions have no memory of previous events unless they query an external database like RDS or DynamoDB, which often adds 50ms to 200ms of latency.

A Durable Object is a stateful micro-server running on Cloudflare’s network with its own private storage. When an agent is instantiated using the Agents SDK, it is assigned a stable ID. All subsequent requests for that user are routed to the same physical instance, allowing the agent to keep its state in memory. Each agent includes an embedded SQLite database with a 1GB storage limit per instance, enabling zero-latency reads and writes for conversation history and task logs.

Durable Objects are single-threaded, which simplifies concurrency management. This design ensures that only 1 event is processed at a time for a specific agent instance, eliminating race conditions. If an agent receives multiple inputs simultaneously, they are queued and processed atomically, ensuring the state remains consistent during complex operations.

Infire: Optimizing Inference with Rust

For the inference layer, Cloudflare developed Infire, an LLM engine written in Rust that replaces Python-based stacks like vLLM. Python engines often face performance bottlenecks due to the Global Interpreter Lock (GIL) and garbage collection pauses. Infire is designed to maximize GPU utilization on H100 hardware by reducing CPU overhead.

The engine utilizes Granular CUDA Graphs and Just-In-Time (JIT) compilation. Instead of launching GPU kernels sequentially, Infire compiles a dedicated CUDA graph for every possible batch size on the fly. This allows the driver to execute work as a single monolithic structure, cutting CPU overhead by 82%. Benchmarks show that Infire is 7% faster than vLLM 0.10.0 on unloaded machines, utilizing only 25% CPU compared to vLLM’s >140%.

Metric vLLM 0.10.0 (Python) Infire (Rust) Improvement
Throughput Speed Baseline 7% Faster +7%
CPU Overhead >140% CPU usage 25% CPU usage -82%
Startup Latency High (Cold Start) <4 seconds (Llama 3 8B) Significant

Infire also uses Paged KV Caching, which breaks memory into non-contiguous blocks to prevent fragmentation. This enables ‘continuous batching,’ where the engine processes new prompts while simultaneously finishing previous generations without a performance drop. This architecture allows Cloudflare to maintain a 99.99% warm request rate for inference.

Code Mode and Token Efficiency

Standard AI agents typically use ‘tool calling,’ where the LLM outputs a JSON object to trigger a function. This process requires a back-and-forth between the LLM and the execution environment for every tool used. Cloudflare’s ‘Code Mode’ changes this by asking the LLM to write a TypeScript program that orchestrates multiple tools at once.

This code executes in a secure V8 isolate sandbox. For complex tasks, such as searching 10 different files, Code Mode provides an 87.5% reduction in token usage. Because intermediate results stay within the sandbox and are not sent back to the LLM for every step, the process is both faster and more cost-effective.

Code Mode also improves security through ‘secure bindings.’ The sandbox has no internet access; it can only interact with Model Context Protocol (MCP) servers through specific bindings in the environment object. These bindings hide sensitive API keys from the LLM, preventing the model from accidentally leaking credentials in its generated code.

February 2026: The v0.5.0 Release

The Agents SDK reached version 0.5.0. This release introduced several utilities for production-ready agents:

  • this.retry(): A new method for retrying asynchronous operations with exponential backoff and jitter.
  • Protocol Suppression: Developers can now suppress JSON text frames on a per-connection basis using the shouldSendProtocolMessages hook. This is useful for IoT or MQTT clients that cannot process JSON data.
  • Stable AI Chat: The @cloudflare/ai-chat package reached version 0.1.0, adding message persistence to SQLite and a “Row Size Guard” that performs automatic compaction when messages approach the 2MB SQLite limit.
Feature Description
this.retry() Automatic retries for external API calls.
Data Parts Attaching typed JSON blobs to chat messages.
Tool Approval Persistent approval state that survives hibernation.
Synchronous Getters getQueue() and getSchedule() no longer require Promises.

Key Takeaways

  • Stateful Persistence at the Edge: Unlike traditional stateless serverless functions, the Agents SDK uses Durable Objects to provide agents with a permanent identity and memory. This allows each agent to maintain its own state in an embedded SQLite database with 1GB of storage, enabling zero-latency data access without external database calls.
  • High-Efficiency Rust Inference: Cloudflare’s Infire inference engine, written in Rust, optimizes GPU utilization by using Granular CUDA Graphs to reduce CPU overhead by 82%. Benchmarks show it is 7% faster than Python-based vLLM 0.10.0 and uses Paged KV Caching to maintain a 99.99% warm request rate, significantly reducing cold start latencies.
  • Token Optimization via Code Mode: ‘Code Mode’ allows agents to write and execute TypeScript programs in a secure V8 isolate rather than making multiple individual tool calls. This deterministic approach reduces token consumption by 87.5% for complex tasks and keeps intermediate data within the sandbox to improve both speed and security.
  • Universal Tool Integration: The platform fully supports the Model Context Protocol (MCP), a standard that acts as a universal translator for AI tools. Cloudflare has deployed 13 official MCP servers that allow agents to securely manage infrastructure components like DNS, R2 storage, and Workers KV through natural language commands.
  • Production-Ready Utilities (v0.5.0): The February, 2026, release introduced critical reliability features, including a this.retry() utility for asynchronous operations with exponential backoff and jitter. It also added protocol suppression, which allows agents to communicate with binary-only IoT devices and lightweight embedded systems that cannot process standard JSON text frames.

Check out the Technical detailsAlso, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.




Source link

Leave a Reply

Your email address will not be published. Required fields are marked *