ADocumentation Index
Fetch the complete documentation index at: https://docs.blobhub.io/llms.txt
Use this file to discover all available pages before exploring further.
session_agent_harness section is designed so that a clean restart picks up exactly where it
left off. The authoritative state is split between server-side envelopes (handoff) and local
YAML files (local recovery), and the worker writes the local file before any externally
observable side-effect.
Worker restart
A graceful shutdown (SIGINT, SIGTERM, q in the TUI) leaves:
instance.yamlremoved.- Each active thread’s local
thread.yamlsnapshotted at the last completed turn boundary.
worker object is left in place (its instance.status only ever holds attached; last_seen_at
goes stale) and is refreshed on the next attach.
An ungraceful exit (SIGKILL, OS crash) skips those steps. The next start handles both cases.
On the next start, for each section:
- The instance lock check accepts a dead-PID
instance.yamland replaces it. - Credentials and identity are re-verified against
/v1/users/me. - The
workersession object is read; ifmetadata.user.user_idmatches, the section re-attaches (overwritinginstancewith our fresh block). If it differs, the section refuses withSESSION_OWNED_BY_DIFFERENT_USER(see Worker object). - The session-event poll resumes from
events.last_processed_at.
Per-thread recovery
A thread is recoverable only when both views agree it was running: the persisted localthread.yaml
shows agent.state == "active" and the server envelope shows instance.state == "active". (The local
thread.yaml keeps the agent state under agent.state; the wire envelope keeps it under
instance.state — see Job Session Object.) For each such
thread, the worker runs:
- Re-validate
workspace.work_folder(same matrix as the initial activation). - Acquire a concurrency slot.
- Resume the agent session via the
codeagentsSDK using the persistedagent_session_id. If the resume fails, the thread transitions tofailedwithAGENT_CRASHED(a failed resume is reported the same way as any other agent crash; there is no separate resume error code). - Cancel any leftover
pending_promptby postingpending_prompt_resolvedwithreason: worker_restart(see Interactive prompts). - Replay thread items posted while the worker was down: fetch
list_session_thread_items(created_since = items.last_consumed.created_at), drop self items, feed remaining user items as the first post-recovery turn’s prompt. - Post a
thread_recoveredactivity-log item on theworkerthread. - Resume normal active behavior.
pending threads on the same section, so threads
already running keep their concurrency slots; new pending work waits if the configured
concurrency.max_agents ceiling is full.
If max_agents was lowered across restart
If you reducedconcurrency.max_agents and there are more recoverable active threads than slots,
the overflow are logged as thread_recover_deferred and remain inert until a future restart with
enough slots. They aren’t transitioned to failed.
Agent crash (worker still running)
An agent process that exits unexpectedly transitions its thread tofailed with AGENT_CRASHED. The
worker writes instance.state = "failed" on the envelope and records the error code and message in the
local thread.yaml; the error is not placed on the envelope. The failure surfaces server-side through
a thread_failed activity item (carrying the code and message) on the worker thread. The worker does
not auto-retry. To resume, the user updates the envelope to instance.state = "pending" (see
Handoff).
Network blips
Transient errors fromapi.blobhub.io (429, 5xx, connection timeouts) are classified as
API_RATE_LIMITED / API_TRANSIENT_ERROR / API_NETWORK_ERROR, logged, and retried with
exponential backoff up to polling.backoff_max_ms (default 30 s). They do not change any state
machine. In the TUI they surface as a warning ribbon; in headless mode they appear in the JSON log.
What does not survive a restart
- The in-memory inbound queue beyond
items.last_consumed(replayed from the server). - In-flight agent turn state beyond the resumable
agent_session_id(recovered by the SDK). - Agent stdout that hadn’t yet been posted — only present in the local
thread.log(not on BlobHub). - An unresolved interactive prompt — cancelled on recovery.
Detachment while running
If someone deletes theworker session object (e.g. via delete_session_object from the API or
playground), the worker observes the session_object_deleted event and stops that section with
SESSION_DETACHED_EXTERNALLY. Other sections continue running. All ThreadAgents for that section
shut down cleanly (the codeagents SDK is asked to cancel each session); their threads remain in
whatever state they were in on the server and the worker reattaches if you re-create the worker
marker.

