Table of Contents
CloudClaw: A Recoverable Execution Substrate for Multi-Tenant LLM Agents
As large language model (LLM) agents evolve from prompt-based systems into autonomous multi-step executors, a new class of system challenges emerges: how to reliably execute large numbers of long-running, multi-tenant agent tasks in cloud environments.
Existing approaches based on container orchestration are insufficient to meet these requirements.
The Problem
In real deployments, there is also a hard infrastructure constraint that is often overlooked: a single host machine can only run a limited number of containers at the same time.
This limit comes from CPU, memory, file descriptors, process scheduling overhead, container runtime limits, and overall system stability. As the number of users grows, a naive “one task, one fresh container” approach quickly hits a scaling bottleneck. Even before GPU or model inference becomes the dominant constraint, the host may already be unable to create or sustain more isolated execution environments.
This creates a fundamental serving problem: how can a system support more concurrent users when the number of execution containers per host is inherently limited?
Modern LLM agents such as OpenCode, OpenClaw, and Claude Code are capable of planning, tool usage, and environment interaction. These capabilities transform LLM usage from simple request-response patterns into long-running execution workflows.
However, current systems face several fundamental limitations.
Cold-Start Overhead
Container-based execution typically launches a new runtime for each task, introducing significant cold-start latency and limiting throughput under concurrent workloads.
Lack of Recoverability
Failures during execution often result in complete task loss. There is no standard mechanism to recover intermediate state or resume execution.
Weak Multi-Tenant Isolation
In multi-user environments, tasks may interfere with each other through shared resources such as filesystems, leading to correctness and security issues.
Tight Coupling Between Logic and Execution
Agent logic is often tightly coupled with execution environments, making it difficult to support multiple runtimes or evolve system architecture.
CloudClaw: Core Idea
CloudClaw is a recoverable execution substrate designed for multi-tenant LLM agents.
Its core principle is to decouple agent logic from execution management.
Instead of increasing container count, CloudClaw focuses on increasing effective utilization of a fixed execution pool.
This allows CloudClaw to provide:
- Persistent task scheduling
- Fault-tolerant execution
- Strong workspace isolation
- Unified runtime abstraction
- Structured result delivery
System Architecture

CloudClaw is organized into three logical planes:
Control Plane
The control plane is responsible for:
- Task scheduling
- Lease-based task ownership
- Heartbeat monitoring
- Failure recovery
- Result persistence
Runtime Plane
The runtime plane encapsulates heterogeneous agent runtimes through a unified interface.
Each runtime is managed via a pre-warmed execution pool, enabling fast task startup without repeated container initialization.
State Plane
The state plane maintains persistent system state, including:
- Task metadata
- Execution logs
- Workspace snapshots
- Final results
Key Design Mechanisms
Recoverable Execution
Tasks in CloudClaw follow a state machine:
QUEUED → RUNNING → SUCCEEDED / FAILED / CANCELED
Execution is governed by lease-based ownership and heartbeat monitoring.
If a worker fails, the lease expires and the task is automatically rescheduled. This ensures that no task is permanently lost.
Pre-Warmed Execution Pools
Instead of launching new containers per task, CloudClaw maintains pools of pre-initialized execution environments.
This significantly reduces queueing delay and improves throughput while staying within host limits.
Workspace Isolation and Persistence
Each task is executed in an isolated workspace.
CloudClaw enforces:
- User-scoped file isolation
- Protection against path escape (e.g., symlinks)
- File size limits
- Persistent state across tasks
Unified Runtime Abstraction
CloudClaw uses an adapter-based design to support multiple runtimes.
Core components include:
- RuntimeAdapter for executing tasks
- WorkspaceManager for managing execution environments
- PoolAdapter for managing execution pools
Structured Result Delivery
Execution results are transformed into structured outputs and stored persistently.
This design decouples execution from result consumption.
Use Cases
Multi-Tenant Agent Platforms
CloudClaw enables reliable execution of large numbers of agent tasks across multiple users.
Large-Scale Agent Experimentation
Researchers can run concurrent experiments with reproducible workflows.
AI Automation Systems
CloudClaw supports long-running automation tasks such as code generation.
Multi-Runtime Deployment
Different agent runtimes can be integrated into a single system.
Conclusion
CloudClaw addresses a fundamental gap in current LLM agent infrastructure.
By introducing recoverable execution, persistent state management, and unified runtime abstraction, it enables reliable, scalable, and multi-tenant agent execution in cloud environments, even under strict container capacity constraints.