Table of Contents

CloudClaw: A Recoverable Execution Substrate for Multi-Tenant LLM Agents

Jinyao Liu (刘晋尧)2026-04-13LLM, Agent, Scheduling

As large language model (LLM) agents evolve from prompt-based systems into autonomous multi-step executors, a new class of system challenges emerges: how to reliably execute large numbers of long-running, multi-tenant agent tasks in cloud environments.

Existing approaches based on container orchestration are insufficient to meet these requirements.

The Problem

In real deployments, there is also a hard infrastructure constraint that is often overlooked: a single host machine can only run a limited number of containers at the same time.

This limit comes from CPU, memory, file descriptors, process scheduling overhead, container runtime limits, and overall system stability. As the number of users grows, a naive “one task, one fresh container” approach quickly hits a scaling bottleneck. Even before GPU or model inference becomes the dominant constraint, the host may already be unable to create or sustain more isolated execution environments.

This creates a fundamental serving problem: how can a system support more concurrent users when the number of execution containers per host is inherently limited?

Modern LLM agents such as OpenCode, OpenClaw, and Claude Code are capable of planning, tool usage, and environment interaction. These capabilities transform LLM usage from simple request-response patterns into long-running execution workflows.

However, current systems face several fundamental limitations.

Cold-Start Overhead

Container-based execution typically launches a new runtime for each task, introducing significant cold-start latency and limiting throughput under concurrent workloads.

Lack of Recoverability

Failures during execution often result in complete task loss. There is no standard mechanism to recover intermediate state or resume execution.

Weak Multi-Tenant Isolation

In multi-user environments, tasks may interfere with each other through shared resources such as filesystems, leading to correctness and security issues.

Tight Coupling Between Logic and Execution

Agent logic is often tightly coupled with execution environments, making it difficult to support multiple runtimes or evolve system architecture.

CloudClaw: Core Idea

CloudClaw is a recoverable execution substrate designed for multi-tenant LLM agents.

Its core principle is to decouple agent logic from execution management.

Instead of increasing container count, CloudClaw focuses on increasing effective utilization of a fixed execution pool.

This allows CloudClaw to provide:

Persistent task scheduling
Fault-tolerant execution
Strong workspace isolation
Unified runtime abstraction
Structured result delivery

System Architecture

CloudClaw is organized into three logical planes:

Control Plane

The control plane is responsible for:

Task scheduling
Lease-based task ownership
Heartbeat monitoring
Failure recovery
Result persistence

Runtime Plane

The runtime plane encapsulates heterogeneous agent runtimes through a unified interface.

Each runtime is managed via a pre-warmed execution pool, enabling fast task startup without repeated container initialization.

State Plane

The state plane maintains persistent system state, including:

Task metadata
Execution logs
Workspace snapshots
Final results

Key Design Mechanisms

Recoverable Execution

Tasks in CloudClaw follow a state machine:

QUEUED → RUNNING → SUCCEEDED / FAILED / CANCELED

Execution is governed by lease-based ownership and heartbeat monitoring.

If a worker fails, the lease expires and the task is automatically rescheduled. This ensures that no task is permanently lost.

Pre-Warmed Execution Pools

Instead of launching new containers per task, CloudClaw maintains pools of pre-initialized execution environments.

This significantly reduces queueing delay and improves throughput while staying within host limits.

Workspace Isolation and Persistence

Each task is executed in an isolated workspace.

CloudClaw enforces:

User-scoped file isolation
Protection against path escape (e.g., symlinks)
File size limits
Persistent state across tasks

Unified Runtime Abstraction

CloudClaw uses an adapter-based design to support multiple runtimes.

Core components include:

RuntimeAdapter for executing tasks
WorkspaceManager for managing execution environments
PoolAdapter for managing execution pools

Structured Result Delivery

Execution results are transformed into structured outputs and stored persistently.

This design decouples execution from result consumption.

Use Cases

Multi-Tenant Agent Platforms

CloudClaw enables reliable execution of large numbers of agent tasks across multiple users.

Large-Scale Agent Experimentation

Researchers can run concurrent experiments with reproducible workflows.

AI Automation Systems

CloudClaw supports long-running automation tasks such as code generation.

Multi-Runtime Deployment

Different agent runtimes can be integrated into a single system.

Conclusion

CloudClaw addresses a fundamental gap in current LLM agent infrastructure.

By introducing recoverable execution, persistent state management, and unified runtime abstraction, it enables reliable, scalable, and multi-tenant agent execution in cloud environments, even under strict container capacity constraints.