Recovery - BlobHub

A session_agent_harness section is designed so that a clean restart picks up exactly where it left off. The authoritative state is split between server-side envelopes (handoff) and local YAML files (local recovery), and the worker writes the local file before any externally observable side-effect.

Worker restart

A graceful shutdown (SIGINT, SIGTERM, q in the TUI) leaves:

instance.yaml removed.
Each active thread’s local thread.yaml snapshotted at the last completed turn boundary.

The worker object is left in place (its instance.status only ever holds attached; last_seen_at goes stale) and is refreshed on the next attach. An ungraceful exit (SIGKILL, OS crash) skips those steps. The next start handles both cases. On the next start, for each section:

The instance lock check accepts a dead-PID instance.yaml and replaces it.
Credentials and identity are re-verified against /v1/users/me.
The worker session object is read; if metadata.user.user_id matches, the section re-attaches (overwriting instance with our fresh block). If it differs, the section refuses with SESSION_OWNED_BY_DIFFERENT_USER (see Worker object).
The session-event poll resumes from events.last_processed_at.

Per-thread recovery

A thread is recoverable only when both views agree it was running: the persisted local thread.yaml shows agent.state == "active" and the server envelope shows instance.state == "active". (The local thread.yaml keeps the agent state under agent.state; the wire envelope keeps it under instance.state — see Job Session Object.) For each such thread, the worker runs:

Re-validate workspace.work_folder (same matrix as the initial activation).
Acquire a concurrency slot.
Resume the agent session via the codeagents SDK using the persisted agent_session_id. If the resume fails, the thread transitions to failed with AGENT_CRASHED (a failed resume is reported the same way as any other agent crash; there is no separate resume error code).
Cancel any leftover pending_prompt by posting pending_prompt_resolved with reason: worker_restart (see Interactive prompts).
Replay thread items posted while the worker was down: fetch list_session_thread_items(created_since = items.last_consumed.created_at), drop self items, feed remaining user items as the first post-recovery turn’s prompt.
Post a thread_recovered activity-log item on the worker thread.
Resume normal active behavior.

Recovery runs before activation of any new pending threads on the same section, so threads already running keep their concurrency slots; new pending work waits if the configured concurrency.max_agents ceiling is full.

If max_agents was lowered across restart

If you reduced concurrency.max_agents and there are more recoverable active threads than slots, the overflow are logged as thread_recover_deferred and remain inert until a future restart with enough slots. They aren’t transitioned to failed.

Agent crash (worker still running)

An agent process that exits unexpectedly transitions its thread to failed with AGENT_CRASHED. The worker writes instance.state = "failed" on the envelope and records the error code and message in the local thread.yaml; the error is not placed on the envelope. The failure surfaces server-side through a thread_failed activity item (carrying the code and message) on the worker thread. The worker does not auto-retry. To resume, the user updates the envelope to instance.state = "pending" (see Handoff).

Network blips

Transient errors from api.blobhub.io (429, 5xx, connection timeouts) are classified as API_RATE_LIMITED / API_TRANSIENT_ERROR / API_NETWORK_ERROR, logged, and retried with exponential backoff up to polling.backoff_max_ms (default 30 s). They do not change any state machine. In the TUI they surface as a warning ribbon; in headless mode they appear in the JSON log.

What does not survive a restart

The in-memory inbound queue beyond items.last_consumed (replayed from the server).
In-flight agent turn state beyond the resumable agent_session_id (recovered by the SDK).
Agent stdout that hadn’t yet been posted — only present in the local thread.log (not on BlobHub).
An unresolved interactive prompt — cancelled on recovery.

Detachment while running

If someone deletes the worker session object (e.g. via delete_session_object from the API or playground), the worker observes the session_object_deleted event and stops that section with SESSION_DETACHED_EXTERNALLY. Other sections continue running. All ThreadAgents for that section shut down cleanly (the codeagents SDK is asked to cancel each session); their threads remain in whatever state they were in on the server and the worker reattaches if you re-create the worker marker.

​Worker restart

​Per-thread recovery

​If max_agents was lowered across restart

​Agent crash (worker still running)

​Network blips

​What does not survive a restart

​Detachment while running

​See also