s11: Error Recovery — Errors aren't the end, they're the start of a retry

s01 → ... → s09 → s10 → s11s12 → s13 → ... → s20

"Errors aren't the end, they're the start of a retry" — escalate tokens, compact context, switch models.

Harness layer: Resilience — classify and recover when the main loop hits errors.


The Problem

The Agent is running along and then errors out:

Error: 529 overloaded

The Agent crashes. It doesn't retry, doesn't switch models, doesn't reduce context — it just crashes.

In production, API errors are the norm. The three most common failure modes: truncated output (the model runs out of tokens mid-sentence), context overflow (still too long even after compaction), and transient failures (429 rate limiting / 529 overload). An Agent that doesn't handle errors is like a car that stalls at the slightest touch.


Solution

Error Recovery Overview)

The loop and prompt assembly from s10 are fully preserved. The only change: the LLM call is wrapped in try/except, with different recovery paths based on error type. After recovery, continue loops back to the top to call the LLM again.

The three most common recovery patterns (the teaching version only handles 429/529; real systems also cover connection errors, timeouts, cloud vendor credential caches, etc. CC actually has 13+ reason codes; see the Deep Dive for the rest):

PatternTriggerRecovery Action
Output truncatedmax_tokensEscalate 8K→64K / continuation prompt
Context overflowprompt_too_longReactive compact → retry
Transient failure429 / 529Exponential backoff + jitter, fallback model on consecutive 529

How It Works

Path 1: Output Truncated

The model runs out of tokens mid-sentence — max_tokens is exhausted. The default 8000 tokens isn't enough for a complete response.

On the first occurrence, escalate max_tokens from 8K to 64K (8x the space) and retry the same request — the truncated output is NOT appended to messages, keeping the original request intact. If 64K is still not enough, save the truncated output and inject a continuation prompt telling the model to pick up where it left off, up to 3 times:

if response.stop_reason == "max_tokens":
    # First escalation: don't append truncated output, retry same request
    if not state.has_escalated:
        max_tokens = ESCALATED_MAX_TOKENS
        state.has_escalated = True
        continue  # messages unchanged, same request with more tokens
    # 64K still truncated: save output + continuation prompt
    messages.append({"role": "assistant", "content": response.content})
    if state.recovery_count < MAX_RECOVERY_RETRIES:
        messages.append({"role": "user", "content":
            "Output token limit hit. Resume directly — "
            "no apology, no recap. Pick up mid-thought."})
        state.recovery_count += 1
        continue
    return  # still truncated after 3 continuations
# Normal: append after max_tokens check
messages.append({"role": "assistant", "content": response.content})

Escalation gets one chance; continuation gets up to 3. After that, exit — further continuations won't produce meaningful output.

Path 2: Context Overflow

The LLM says "your context is too long" (prompt_too_long). All four compaction layers from s08 have already run, and it's still over the limit.

Trigger reactive compact — more aggressive than auto compact. The teaching version keeps only the last 5 messages to simulate compaction; real CC generates a compact summary via LLM, then retries with the compacted message list. Retry after compacting. But if it's still over the limit after one compaction, the only option is to exit — compacting again won't make it any smaller:

except PromptTooLongError:
    if not state.has_attempted_reactive_compact:
        messages[:] = reactive_compact(messages)
        state.has_attempted_reactive_compact = True
        continue
    return  # Already compacted and still over limit — must exit

Path 3: Transient Failures

Network blips, 429 rate limiting, 529 overload — these aren't bugs, they're normal in distributed systems.

Both 429 and 529 use exponential backoff + jitter: wait 0.5 seconds on the first attempt, 1 second on the second, 2 seconds on the third, up to 10 retries. Random jitter prevents concurrent requests from all retrying at the same instant. Three consecutive 529 overload errors → switch to the fallback model (if FALLBACK_MODEL_ID environment variable is configured):

def retry_delay(attempt, retry_after=None):
    if retry_after:
        return retry_after
    base = min(500 * (2 ** attempt), 32000) / 1000
    return base + random.uniform(0, base * 0.25)

def with_retry(fn, state, max_retries=10):
    for attempt in range(max_retries):
        try:
            return fn()
        except (RateLimitError, OverloadedError):
            delay = retry_delay(attempt)
            time.sleep(delay)
            if is_overloaded:
                state.consecutive_529 += 1
                if state.consecutive_529 >= 3 and FALLBACK_MODEL:
                    state.current_model = FALLBACK_MODEL
    raise MaxRetriesExceeded()

Backoff formula: min(500 × 2^attempt, 32000) + random(0~25%). If the server returns a Retry-After header, that value takes priority.

Putting It All Together

def agent_loop(messages, context):
    system = get_system_prompt(context)
    state = RecoveryState()
    max_tokens = 8000

    while True:
        try:
            response = with_retry(
                lambda: client.messages.create(
                    model=state.current_model, system=system,
                    messages=messages, tools=TOOLS,
                    max_tokens=max_tokens),
                state)
        except Exception as e:
            if is_prompt_too_long_error(e):
                if not state.has_attempted_reactive_compact:
                    messages[:] = reactive_compact(messages)
                    state.has_attempted_reactive_compact = True
                    continue
                return
            log_error(e)
            return

        # max_tokens check BEFORE appending to messages
        if response.stop_reason == "max_tokens":
            if not state.has_escalated:
                max_tokens = 64000
                state.has_escalated = True
                continue  # retry same request, messages unchanged
            # save truncated output + continuation prompt
            messages.append({"role": "assistant", "content": response.content})
            messages.append({"role": "user", "content": CONTINUATION_PROMPT})
            continue
        # Normal completion
        messages.append({"role": "assistant", "content": response.content})

        if response.stop_reason != "tool_use":
            return
        # ... tool execution ...

The outer try/except catches API exceptions (prompt_too_long, etc.), with_retry handles transient errors (429/529), and stop_reason checks handle truncation. Three recovery mechanisms, each handling its own error type.


Changes from s10

ComponentBefore (s10)After (s11)
Error handlingNone (crashes on any error)Three recovery patterns + exponential backoff
New constantsESCALATED_MAX_TOKENS=64000, MAX_RETRIES=10, BASE_DELAY_MS=500, FALLBACK_MODEL
New functionswith_retry, retry_delay, reactive_compact, is_prompt_too_long_error, RecoveryState
Toolsbash, read_file, write_file (3)bash, read_file, write_file (3) — unchanged
LoopBare LLM callWrapped in try/except + continue retry

Try It

cd learn-claude-code
python s11_error_recovery/code.py

Try these prompts:

  1. Ask the Agent to generate a very long piece of code, and observe whether it automatically continues after truncation (look for the [max_tokens] escalating log)
  2. Read many files consecutively to bloat the context, and observe reactive compact
  3. If you encounter 429/529, observe the exponential backoff log output

What's Next

The Agent can now automatically recover from errors. But the tasks it handles are still one-shot — you give it a task, it finishes, it's done.

What if the Agent could manage a task list — with dependencies, persisted to disk, resumable across sessions? A TODO list is not a task system.

s12 Task System → Tasks form a dependency graph with state and persistence. This is the foundation for multi-Agent collaboration.

Deep Dive into CC Source

The following is based on CC source code: query.ts (1729 lines), services/api/withRetry.ts (822 lines), query/tokenBudget.ts (93 lines), and utils/tokenBudget.ts (73 lines).

1. A Dozen-Plus Reason/Transition Codes (Not Just 3)

The teaching version covers 3 of the most common recovery patterns. CC actually has a dozen-plus reason/transition codes, evaluated after every LLM call:

Reason/TransitionTeaching VersionCC Behavior
completedNormal completionReturn result
next_turnNormal tool callContinue to next tool execution round
max_output_tokens_escalatePath 18K→64K escalation
max_output_tokens_recoveryPath 1 continuationContinuation prompt (up to 3 times)
reactive_compact_retryPath 2Reactive compact → retry
prompt_too_longPath 2Same as above
collapse_drain_retryNot coveredContext collapse — commit staged content first
model_errorNot coveredRetry
image_errorNot coveredImageSizeError / ImageResizeError handled specifically
aborted_streamingNot coveredStreaming abort recovery
aborted_toolsNot coveredTool abort
stop_hook_blockingNot coveredInject blocking error → model self-corrects
stop_hook_preventedNot coveredHooks prevent execution
hook_stoppedNot coveredHook stopped execution
token_budget_continuationNot coveredContinue when token usage < 90%
blocking_limitNot coveredBlocking limit reached
max_turnsNot coveredMaximum turns reached

The teaching version only expands on the first 5 (most common); each of the rest has its own dedicated handling logic.

2. Precise Exponential Backoff Formula

CC's backoff delay (withRetry.ts:530-548):

delay = min(500 × 2^(attempt-1), 32000) + random(0~25%)
AttemptBase Delay+ Jitter
1500ms0-125ms
21000ms0-250ms
44000ms0-1000ms
7+32000ms (cap)0-8000ms

If the server returns a Retry-After header, that value takes priority.

3. Original CONTINUATION Prompt

CC's continuation prompt (query.ts:1225-1227):

Output token limit hit. Resume directly — no apology, no recap of what
you were doing. Pick up mid-thought if that is where the cut happened.
Break remaining work into smaller pieces.

Token budget nudge prompt (tokenBudget.ts:72):

Stopped at {pct}% of token target. Keep working — do not summarize.

4. Streaming Error Handling

In CC's streaming path, recoverable errors (413, max_tokens, media errors) are withheld from display during streaming (query.ts:788-822) — SDK consumers don't see them, only the recovery logic does. After streaming ends, the system determines whether recovery is needed.

5. 529 → Fallback Model Switch

After 3 consecutive 529 overload errors (MAX_529_RETRIES = 3), CC automatically switches to the fallback model (e.g., Opus → Sonnet). On switch, all pending messages and tool results are cleared, and the user sees "Switched to {model} due to high demand".

6. Diminishing Returns Detection

Token budget "continuations" aren't unlimited. When there are 3 consecutive continuations with a token increment < 500, the system determines "continuing won't produce meaningful output" and stops continuation (tokenBudget.ts:60-62).