Domain 1 — seven core concepts from agentic loops to session forking, distilled for exam readiness.
The agentic loop is a while loop where Claude decides when to stop — not your code. You send a message, Claude responds, and you check stop_reason to decide what happens next.
If stop_reason == "tool_use" → execute the tool, append the result, loop again. If stop_reason == "end_turn" → Claude is done, exit.
pythonmessages = [{"role": "user", "content": user_query}]
while True:
response = client.messages.create(
model="claude-sonnet-4-5-20250514",
tools=my_tools,
messages=messages
)
# Append Claude's response to conversation history
messages.append({"role": "assistant", "content": response.content})
if response.stop_reason == "end_turn":
break
# Execute tools, append results as a user message
tool_results = execute_tools(response.content)
messages.append({"role": "user", "content": tool_results})
Tool results are appended as a user message. This is how Claude "sees" what happened — it's all conversation history.
Parsing Claude's text for "I'm done" to decide loop termination. Use stop_reason, not natural language signals.
Hardcoding iteration caps as the primary stop mechanism. A safety cap is fine, but shouldn't be the design.
Checking for assistant text content as a completion indicator. Claude can emit text and tool calls in the same response.
Always use stop_reason as the single source of truth for loop control. Model-driven termination is the correct pattern.
Claude is the reasoning engine. Your code is the executor. The loop continues until Claude says it's done via stop_reason.
Instead of one agent doing everything, a coordinator delegates to subagents in a hub-and-spoke pattern. The coordinator is the project manager — it never does the actual work itself.
Analyze the request → Decompose into subtasks → Delegate to subagents → Aggregate results → Evaluate completeness → loop if gaps remain.
Subagents have isolated context. They do NOT inherit the coordinator's conversation, other subagents' findings, or the user's original query. You must pass everything explicitly.
Every query runs through ALL subagents in sequence regardless of complexity. "What are your hours?" hits search → analysis → synthesis.
Coordinator analyzes query, picks only the subagents needed. Simple lookups skip analysis. Complex queries use the full pipeline.
After synthesis, the coordinator asks: "Is this complete?" If not, it re-delegates with targeted queries and re-invokes synthesis. This iterative refinement is what separates a good system from a basic one.
Over-decomposition trap: Slicing a topic too narrowly means each subagent covers a tiny slice well but misses the connections between them. Decompose at the right granularity.
The coordinator is the brain. Subagents are the hands. All communication flows through the coordinator — giving you observability, error handling, and control.
The hands-on mechanics of spawning subagents, passing context, and configuring agent definitions.
Subagents are spawned via the Task tool. The coordinator's allowedTools must include "Task" — otherwise it cannot spawn subagents. Each Task call creates a fresh agent with its own context window.
python{
"description": "Searches the web for current information",
"system_prompt": "You are a focused research agent...",
"allowed_tools": ["web_search", "web_fetch"]
}
Each subagent gets constrained tools. The description field is how the coordinator decides which subagent to invoke — it's decision logic, not documentation.
Task("search_agent", "Find more about the topic") — the subagent has no idea what "the topic" is.
Task("search_agent", "Find recent research on lithium-ion battery recycling. Prior analysis found: [hydro, pyro]. Focus on newer methods.")
json{
"findings": [{
"content": "Recycling rates improved 30%",
"source_url": "https://example.com/study",
"document": "2024 Battery Report",
"page": 14
}],
"gaps_identified": ["Cost comparison data missing"]
}
Separate content from metadata so downstream agents can write summaries and provide attribution.
Emit multiple Task calls in a single coordinator response to run subagents concurrently. Massively faster than sequential delegation when subtasks are independent.
"Step 1: Call search agent. Step 2: Call analysis. Step 3: Call synthesis."
"Produce a research report with 5+ credible sources covering X, Y, Z. Delegate to specialists as needed. Verify coverage before delivering."
Context doesn't flow automatically — you pump it explicitly. Use structured formats to preserve attribution. Spawn in parallel when independent. Design coordinator prompts around outcomes, not procedures.
Where theory meets production. Real workflows have rules that cannot be broken and sometimes require escalation to humans.
Prompts are suggestions; hooks are guarantees. If you'd get fired for the 1% failure, use a hook. If it's a preference, a prompt is fine.
pythondef pre_tool_hook(tool_name, tool_input):
if tool_name == "process_refund":
if not workflow_state["customer_verified"]:
return BLOCK("Cannot process refund: customer not verified")
if not workflow_state["order_retrieved"]:
return BLOCK("Cannot process refund: order not retrieved")
return ALLOW
Claude doesn't even know this gating exists. It tries to call the tool, gets blocked with an explanation, and naturally adjusts within the agentic loop.
Customer says: "Wrong item, promo code didn't work, and update my address." — that's three issues. A good agent decomposes them, investigates all three in parallel using shared context, then synthesizes one unified response.
json{
"customer_id": "CUST-4521",
"issue_summary": "Defective product, requests full refund",
"root_cause": "Shipped from damaged batch #B-229",
"recommended_action": "Full refund $189.99 + 15% credit",
"reason_for_escalation": "Exceeds $100 agent authority"
}
Escalation isn't failure — it's a designed part of the system. The agent's job is to minimize the work the human has to do by handing off a complete picture.
Programmatic gates for rules that can't fail. Parallel investigation for multi-concern requests. Structured summaries for human handoffs.
Hooks sit between Claude and the outside world, intercepting in two directions.
Different backends return dates as Unix epochs, ISO 8601, or human strings. Claude usually handles this, but sometimes makes comparison errors. A PostToolUse hook fixes this deterministically.
pythondef post_tool_normalize(tool_name, tool_result):
normalized = tool_result.copy()
# Normalize all dates to ISO 8601
for key in ["created_at", "ship_date", "due_date"]:
if key in normalized:
normalized[key] = to_iso8601(normalized[key])
# Normalize status codes to readable strings
if "status" in normalized and isinstance(normalized["status"], int):
normalized["status"] = status_map.get(normalized["status"])
return normalized
Claude should reason about your domain, not about data format inconsistencies. Normalize field names, null handling, error formats, and date representations before Claude ever sees them.
pythondef pre_tool_policy_check(tool_name, tool_input):
if tool_name == "process_refund":
amount = float(tool_input.get("amount", 0))
if amount > 500.00:
return {
"action": "BLOCK",
"message": "Exceeds $500 limit. Escalate to human."
}
return {"action": "ALLOW"}
The block message is instructive — it tells Claude what to do instead. Claude receives this as the tool result and naturally adjusts.
| Requirement | Hook | Prompt |
|---|---|---|
| Refund limits | ✓ | ✗ |
| Data format consistency | ✓ | ✗ |
| Tone of voice | ✗ | ✓ |
| PII redaction before logging | ✓ | ✗ |
| Suggested upsells | ✗ | ✓ |
| Blocking admin tools | ✓ | ✗ |
| Preferred resolution order | ✗ | ✓ |
Post-hooks clean incoming data so Claude reasons clearly. Pre-hooks enforce rules so Claude can't break policy. Together they separate what Claude thinks about from what it's allowed to do.
How to break big problems into small ones — and knowing which pattern fits which situation.
Steps are known in advance. Each feeds into the next. Best for predictable workflows like code review, report generation, multi-language translation.
Pass 1 (per-file): Each file gets focused individual analysis — full attention, no dilution. Pass 2 (cross-file): Integration pass to catch systemic patterns spanning multiple files. You'd miss things doing either pass alone.
Steps emerge as you go. What you discover in step 1 determines what step 2 even is. Best for investigative tasks like debugging, legacy test coverage, open-ended research.
| Question | Yes → | No → |
|---|---|---|
| Do I know all steps before starting? | Prompt chaining | Dynamic |
| Does step N change what step N+1 is? | Dynamic | Prompt chaining |
| Task | Pattern |
|---|---|
| Summarize a doc in 3 languages | Prompt chaining |
| Debug why the app is slow | Dynamic decomposition |
| Review a PR against style guidelines | Prompt chaining |
| Migrate REST to GraphQL | Dynamic decomposition |
| Monthly report from 5 data sources | Prompt chaining |
| Figure out why customers churn | Dynamic decomposition |
In practice, many workflows use a fixed outer pipeline where individual steps use dynamic decomposition internally. The coordinator runs the pipeline; subagents adapt within each step.
Prompt chaining when predictable. Dynamic decomposition when investigative. Split big reviews into per-unit passes + cross-unit integration. Give agents permission to adapt via goal-oriented prompts.
How agents maintain continuity across time and explore branching paths.
bash# Start a named session
claude --session "auth-refactor" "Analyze the auth module for security issues"
# Resume later
claude --resume "auth-refactor" "Now suggest fixes for the top 3 issues"
If you edited files between sessions, the agent's context contains old tool results. It doesn't know things changed. Either tell it exactly what changed (small edits) or start fresh with an injected summary (extensive changes).
"I updated auth.py lines 45-60 and fixed the token bug in middleware.py. Re-analyze these two files." — good for small, targeted changes.
Start a new session seeded with distilled findings, status, and a list of changes. More reliable when files have changed extensively.
Comparing two testing strategies. Trying two refactoring approaches before committing. Exploring optimistic vs. conservative solutions. A/B testing prompt strategies for a subagent.
| Situation | Strategy |
|---|---|
| Files mostly unchanged | Resume the session |
| Files heavily modified | Fresh session + injected summary |
| Want to explore 2+ approaches | Fork from shared baseline |
Resume when context is fresh. Fork when exploring alternatives. Start fresh with injected summaries when the world has changed. The judgment call is always about context freshness.
Domain 2 — covering tool descriptions, error handling, tool distribution, tool_choice, MCP servers, and built-in tools.
Claude doesn't see your code. It reads tool descriptions to decide which tool to use. When descriptions are vague or overlapping, Claude guesses — and guesses wrong.
"analyze_content" "Analyzes content and returns results" "analyze_document" "Analyzes a document and provides analysis"
"extract_web_results" "Extracts structured data from search result pages. Input: raw HTML. Output: JSON array of {title, url, snippet}. Use instead of summarize_content for search engine output."
Five ingredients separate great descriptions from bad ones.
One clear sentence on what the tool does.
What does Claude need to feed this tool?
What comes back?
What happens with weird or empty input?
When to use this tool instead of another.
"verify_claim_against_source" PURPOSE: "Checks whether a factual claim is supported by a given source document." INPUT: "JSON with 'claim' (string, one sentence) and 'source_text' (string, full text)." OUTPUT: "JSON with 'supported' (bool), 'confidence' (0.0-1.0), and 'relevant_excerpt' (string)." EDGE: "Returns supported=false, confidence=0.0 if source is empty or claim not addressed." BOUNDARY: "Use this for fact-checking a single claim. Use summarize_content for overviews. Use extract_data_points for structured data."
analyze_document Does everything vaguely — Claude has to guess which behavior you want
extract_data_points → structured fields summarize_content → overviews verify_claim → fact-checking
Hiring three specialists (plumber, electrician, painter) instead of one generalist. Each knows their lane — results are better across the board.
Perfect tool descriptions can still fail if your system prompt contains keywords that accidentally create tool associations.
# System prompt contains: "Always analyze content before responding." # Tool name happens to be: analyze_content # Result: Claude gravitates to this tool # regardless of context, because the keyword # match is strongest.
Find words like "analyze," "search," "check," "extract," "look up" in your system prompt. These are the magnets.
If any of those verbs match a tool name or description, you have a potential collision. Reword the system prompt.
System prompt: "Always analyze content carefully..." Tool name: analyze_content
System prompt: "Be thorough when processing requests..." Tool name: extract_web_results
When a tool fails, does it give Claude enough information to recover intelligently? Generic errors lead to generic (bad) recovery.
Server timeout, temporary unavailability. Retryable.
Malformed input, wrong data type. Not retryable without fixing input.
Policy violation, rule blocked. Not retryable.
No access to this resource. Not retryable without escalation.
{
"isError": true,
"content": "Operation failed"
}
{
"isError": true,
"content": "Order below $50 minimum",
"metadata": {
"errorCategory": "business",
"isRetryable": false,
"customerMessage": "Discounts
require orders of $50+",
"currentTotal": 32.99
}
}
{
"isError": true,
"content": "No customers found"
}
Claude may retry or apologize for a "failure"
{
"isError": false,
"content": "0 customers matched",
"results": []
}
Claude treats this as useful information
Think of error responses like a doctor's diagnosis. "Something's wrong" isn't helpful. "Sprained ankle, no surgery needed, rest for two weeks" — now you know exactly what to do.
Who handles the error — the subagent or the coordinator? It depends on the error type.
Transient failures — retry once or twice locally. Only escalate if still failing.
Errors the subagent can't resolve, workflow-affecting failures, skip/reroute/abort decisions.
When local recovery fails, the subagent sends the coordinator everything it knows:
{
"isError": true,
"content": "Customer API timed out after 2 retries",
"metadata": {
"errorCategory": "transient",
"isRetryable": true,
"attemptsMade": 2,
"partialResults": null,
"whatWasAttempted": "GET /customers/{id}"
}
}
If 2 of 3 lookups succeed, send back what you got. Never discard good work because of one failure.
{
"partialResults": {
"customerProfile": { "name": "Jane", "tier": "gold" }, ✓
"orderHistory": [{ "id": "ORD-123" }], ✓
"supportTickets": null ✗ failed
},
"failedStep": "supportTickets"
}
More tools means worse decisions. Each agent should only have tools relevant to its role.
synthesis_agent = Agent( tools=[ "merge_findings", "generate_summary", "web_search", # not its job! "fetch_url", # not its job! ... 14 more ... ] )
synthesis_agent = Agent( tools=[ "merge_findings", "generate_summary", "verify_fact", # scoped cross-role ] )
| Agent | Tools |
|---|---|
| research | web_search, fetch_url, extract_data |
| analysis | run_query, compute_stats, generate_chart |
| synthesis | merge_findings, generate_summary |
| coordinator | delegate_to_agent (no domain tools) |
When a specialized agent needs limited cross-role access, replace the generic tool with a constrained version:
tools=["fetch_url"] Agent can fetch ANY URL — dangerous
tools=["load_document"] Only accepts validated document URLs
A kitchen: the pastry chef has whisks and piping bags. The sushi chef has knives and a bamboo mat. You could give both every tool — but the pastry chef doesn't need a sushi knife, and handing them one invites accidents.
Three modes that control whether and which tool Claude calls.
Claude decides whether to call a tool and which one. Default mode — best for open-ended tasks.
Claude MUST call a tool, but picks which. Prevents conversational responses when action is needed.
Claude MUST call THIS specific tool. Guarantees a step runs — used for enforcing pipeline order.
# Turn 1: Force metadata extraction first response = client.messages.create( tools=[extract_metadata, enrich, validate], tool_choice={"type": "tool", "name": "extract_metadata"} ) # Turn 2: Now let Claude choose freely response = client.messages.create( tools=[enrich, validate], tool_choice={"type": "auto"} )
| Question | Answer |
|---|---|
| Need Claude to always act? | tool_choice: "any" |
| Specific tool must run first? | Forced, then switch to "auto" |
| Claude choosing well on its own? | tool_choice: "auto" (default) |
| Claude choosing poorly? | Fix tool descriptions first — force as last resort |
How external tools actually connect to Claude through MCP servers — configuration, scoping, and best practices.
A translator between Claude and external services. Tools are discovered at connection time — no manual registration needed.
.mcp.jsonProject-level. Committed to version control. Shared with the whole team. Use for Jira, GitHub, shared DBs.
~/.claude.jsonUser-level. Personal to you. Use for experiments and tools you're testing before proposing to the team.
// .mcp.json — safe to commit (no actual secrets) { "mcpServers": { "github-server": { "command": "npx", "args": ["-y", "@mcp/server-github"], "env": { "GITHUB_TOKEN": "${GITHUB_TOKEN}" ← from .env } } } }
| Need | Approach |
|---|---|
| Jira / GitHub / Slack | Use existing community MCP server ✓ |
| Standard integration | Community server — don't reinvent it ✓ |
| Proprietary internal tool | Build a custom MCP server ✓ |
Claude prefers its built-in tools (like Grep) when MCP tool descriptions are vague. The fix: tell Claude explicitly when the MCP tool is better.
// Vague — Claude ignores it, uses Grep instead "Searches code" // Detailed — Claude understands the advantage "Semantic search across the full codebase using embeddings. Finds conceptually related code even without exact keyword matches. Use over Grep when searching by concept."
10 exploratory tool calls just to
understand what data is available
list_issues() → 50 results
get_issue(1) → read
get_issue(2) → read
...
Browse the catalog first, then go straight to the relevant item resource: issue_summaries → scan get_issue(47) → the one that matters
Six built-in tools, each with a specific sweet spot. Using the wrong one wastes time and context.
| Tool | Purpose |
|---|---|
| Grep | Search inside files — find function names, error messages, import statements |
| Glob | Search for files — match by name or extension pattern (e.g., **/*.test.tsx) |
| Read | Load a file's full contents into context |
| Write | Create or fully overwrite a file |
| Edit | Surgically modify a specific part of a file (requires unique text match) |
| Bash | Run any shell command (tests, installs, scripts) |
"Where is calculateTotal defined?" grep -r "calculateTotal" src/
"Find all test files" glob("**/*.test.tsx")
Edit works by matching unique text. If the target text appears multiple times, Edit fails. Fall back to Read+Write:
# Edit works — "MAX_RETRIES = 3" is unique Edit(file="config.py", old="MAX_RETRIES = 3", new="MAX_RETRIES = 5") # Edit fails — "return None" appears 6 times # Fallback: Read → modify programmatically → Write content = Read("handlers.py") updated = content.replace("return None # fallback", "return default", 1) Write("handlers.py", updated)
Start narrow, expand as needed. Never read every file upfront — most will be irrelevant and waste your context window.
| I need to… | Use |
|---|---|
| Find files containing "TODO" | Grep |
| Find all .yaml config files | Glob |
| Understand what a file does | Read |
| Create a new file | Write |
| Change one line in a big file | Edit |
| Edit fails (non-unique match) | Read + Write |
| Run tests or install packages | Bash |
| Map function usage | Grep → Read → Grep |
Real file structures and annotated code examples for every concept
Tasks 4.1 & 4.2 — Explicit criteria, false positive management, and few-shot prompting for consistent, reliable output.
Vague instructions like "be conservative" or "only report high-confidence findings" give Claude no actionable framework. It still has to decide what conservative means for every finding — and it will decide inconsistently. Explicit categorical criteria always outperform confidence-based filtering.
"Review this code and flag issues. Be conservative — only report high-confidence findings."
"Flag comments only when claimed behavior contradicts actual code behavior. Do NOT flag: minor style differences, local naming conventions, or TODO comments."
False positives are poisonous — not just locally, but globally. If your code review agent has a 60% false positive rate on style suggestions, developers start ignoring everything, including accurate bug reports. High false-positive categories undermine confidence in accurate categories.
The fix for a high false-positive category isn't tweaking confidence thresholds. It's either making the criteria razor-sharp with concrete code examples, or temporarily disabling that category entirely while you improve the prompts. Restoring trust comes first.
Don't just say "flag bugs." Provide concrete examples of what constitutes a bug in your context:
Each severity level should have its own concrete examples. Abstract definitions like "high = significant impact" produce inconsistent classifications.
Telling Claude what not to flag is where most false positive reduction comes from:
If a category (e.g., "style suggestions") consistently produces noise, the pragmatic move is to disable it entirely while you refine the prompts.
Why? Because every false positive trains developers to click "dismiss" reflexively. You need to break that habit before re-introducing the category with better criteria. Trust is rebuilt incrementally.
If an exam question offers "lower the confidence threshold" vs. "add explicit categorical criteria" — always pick explicit criteria. Confidence-based filtering is the distractor pattern in this domain.
You can write the most detailed instructions imaginable — "provide location, issue, severity, and suggested fix" — and across 50 runs, you'll still get wildly different interpretations of severity, varying detail levels, and inconsistent formatting.
Instructions describe the shape of the output. Few-shot examples demonstrate the judgment.
Instructions are the rules. Few-shot examples are the case law. Rules tell you what's legal. Case law shows you how ambiguous situations were actually decided.
The "skip" example is where the real value lives. It teaches Claude what not to flag — which directly reduces false positives. Without it, Claude will over-report acceptable patterns as issues.
When extracting data from messy documents — informal measurements, varied structures, missing fields — showing Claude how to handle different document formats prevents it from fabricating values when the format doesn't match expectations.
Example use cases:
Good few-shot examples teach Claude to generalize judgment to novel patterns, not just match the specific cases you showed. That's why showing reasoning alongside each example matters — Claude learns the decision-making process, not just the outcome.
This domain may offer distractors that collapse prompt engineering into a single API parameter (e.g., confidence_threshold: 0.9 in a constructor). Real precision comes from prompt design, not configuration flags. Same principle — different domain.
Domain 5 — six task statements covering context degradation, escalation logic, error propagation, codebase exploration, human review, and information provenance.
Context is a finite, degrading resource. Every long interaction is a fight against three forces that erode the information your agent needs to do its job.
When context grows long, the system summarizes earlier turns. Concrete transactional facts — dollar amounts, dates, order numbers — get smoothed into vague descriptions. The exact data the agent needs to resolve an issue disappears.
"Customer disputes a pricing discrepancy on a recent order."
"Charged $47.99 on March 3rd for order #8812 — website showed $39.99."
Models process information at the beginning and end of long inputs most reliably. Information buried in the middle of a large context window is more likely to be overlooked. If the customer's key complaint was stated in turn 14 of a 30-turn conversation, it's in the danger zone.
Beginning of context = high reliability. End of context = high reliability. Middle of context = degraded attention. Place critical data at the top or bottom, never buried in the middle.
An lookup_order call returns 40+ fields — shipping carrier tracking events, warehouse codes, internal SKU data, tax breakdowns. The agent only needed 5 fields. Those 40 fields now sit in context, consuming tokens and diluting signal.
Across a multi-turn session with several lookups, this compounds: 200 irrelevant fields vs. 25 relevant ones.
Context isn't just "the conversation so far." It's an engineered input that you actively curate — deciding what enters, what format it takes, and where it's positioned.
Four techniques to counter the three threats. Each targets a specific failure mode.
Extract critical transactional data into a separate structured block included in every prompt, outside of summarized history. The conversation can be condensed — the case facts block always stays intact.
case facts=== CASE FACTS ===
Customer: Jane Martinez, ID #4421
Issue 1: Order #8812 — charged $47.99, expected $39.99, placed March 3
Issue 2: Order #8840 — return requested, status: delivered March 7
Escalation: None requested
=== END CASE FACTS ===
For multi-issue sessions, this becomes a structured context layer — a living document that grows as new issues surface.
Structured issue data (order IDs, amounts, statuses) lives in a separate layer from conversation history. History can be summarized aggressively because the facts are protected externally.
Post-process tool results before they enter context. Keep only the fields relevant to the agent's current role. A 40-field order object becomes 5 fields of signal.
40 fields: warehouse routing codes, internal flags, tax line items, carrier tracking events, SKU metadata…
5 fields: order ID, item total, order date, status, return eligibility
Counter the lost-in-the-middle effect with deliberate placement. Key findings summaries go at the beginning of aggregated inputs. Detailed results use explicit section headers. The case facts block sits at the top of the prompt.
Upstream subagents return structured data (key facts, citations, relevance scores) instead of verbose content and reasoning chains. Downstream agents with limited context budgets receive clean, compact inputs — not paragraphs of reasoning to parse.
Subagents must include metadata in structured outputs: dates, source locations, and methodological context. This isn't optional — it supports accurate downstream synthesis (see Task 5.6).
Every technique here is about the same thing — being intentional about what enters context, what format it's in, and where it's positioned.
Escalation isn't about difficulty — it's about authority boundaries. The curriculum defines three specific triggers.
| Trigger | When It Fires | Key Nuance |
|---|---|---|
| Customer requests human | Explicit demand for a person | Non-negotiable hard rule — honor immediately |
| Policy exception / gap | Policy is silent or ambiguous on the request | Don't guess by analogy — silence = gap = escalate |
| No meaningful progress | Agent has tried and is genuinely stuck | Not "this is complex" — but "I cannot move forward" |
Complexity alone. A case can be complex but fully within the agent's capability. Escalating because something has multiple steps wastes human capacity.
Soft request: Customer says "I want to talk to a manager." The issue is a simple return. You get one offer to resolve. If they reiterate, escalate immediately. No second attempts.
Hard request: Customer says "Get me a human NOW." Escalate immediately. No investigation, no "let me just check." Emphatic demands are honored without delay.
When a customer explicitly demands a human, this is a non-negotiable hard rule. Always honor it immediately. This is NOT conditional on the agent's assessment of the case.
"If customer sounds angry, escalate." Angry customers may have simple issues. Calm customers may have impossible ones. Tone ≠ complexity.
"If model confidence < 0.7, escalate." Claude isn't well-calibrated at rating its own certainty. High confidence can coexist with wrong answers.
Your policy covers price matching against your own site's pricing errors. Customer asks for a competitor price match. The policy says nothing about competitors. The agent should not apply existing policy by analogy — silence in the policy is a gap, and gaps belong to humans.
Agent searches and finds two "Jane Martinez" accounts. The correct response is to ask for an additional identifier (zip code, email), not to pick one based on heuristics like "most recent order."
When uncertain — whether about policy, customer identity, or case facts — ask for clarification rather than making a heuristic selection. This applies to ambiguous tool results, ambiguous customer requests, and ambiguous policy interpretation.
Escalate when you lack the authority (policy gaps), the ability (no progress), or the permission (customer demands it). Everything else, resolve.
When a subagent fails, the coordinator needs enough context to make a smart decision — retry, fall back, proceed with partial results, or surface the gap to the user.
Inventory subagent times out → returns empty result as if it succeeded → coordinator tells customer "item out of stock" → but the item isn't out of stock, the system just didn't respond. False information delivered.
Inventory subagent times out → entire workflow halts → order lookup and return check both succeeded but their results are discarded → customer gets "something went wrong." Useful work wasted.
This is the foundational distinction. A subagent gets nothing back — but why?
| Type | What Happened | Coordinator Action |
|---|---|---|
| Access failure | Timeout, 500 error, connection dropped — query never completed | Retry, use fallback, inform customer of temp issue |
| Valid empty result | Query succeeded normally, no matching records exist | Confidently tell customer no record exists |
If the subagent returns "no results" for both cases, the coordinator can't distinguish them. Generic error statuses like "service unavailable" strip away decision-critical context.
json{
"status": "error",
"failure_type": "timeout",
"what_was_attempted": "inventory check for SKU #4419",
"partial_results": null,
"retry_recommended": true,
"alternatives": "check warehouse B system directly"
}
Subagents handle transient failures locally (retry once or twice). Only propagate errors they cannot resolve — including what was already attempted and any partial results. This prevents redundant coordinator retries.
When the coordinator assembles a response and one subagent failed, the output flags the gap explicitly: "Your order #8812 is eligible for return. Note: I wasn't able to check current inventory for a replacement — I can check again or connect you with a team member."
On failure, return structured error objects with partial results and coverage annotations — not silent failures. This is a recurring principle across the certification: always preserve work that already succeeded.
Errors are data, not just failures. A well-designed system treats every error as a decision point for the coordinator — with enough structured context to decide intelligently.
Same degradation problems, different domain. Claude Code explores a massive codebase — thousands of files, complex dependency chains. The context window fills with verbose discovery output, and early findings get lost.
Early in the session, the agent references specific classes it discovered. Later, it starts talking about "typical patterns" instead — generic training knowledge replacing its own findings. The context filled up and the discoveries were pushed out.
The agent maintains a file on disk recording key findings as it goes. When it needs to answer a question later, it reads the scratchpad first. Findings are on disk — not dependent on surviving in the context window.
scratchpad=== SCRATCHPAD: Project Architecture ===
Entry point: src/main/App.java
Refund logic: src/services/RefundCalculator.java
Dependencies: PricingEngine, TaxService, InventoryClient
Tests: tests/services/RefundCalculatorTest.java (47 cases)
Config: refund thresholds in config/policies.yaml
Instead of the main agent grepping through hundreds of files (filling its own context), it spawns subagents with isolated context windows:
Summarize findings from Phase 1 before spawning subagents for Phase 2. Inject those summaries into the new subagents' initial context. They start with knowledge, not from scratch.
Each agent periodically exports its state to a known file location. A coordinator-level manifest tracks completion status across all agents. On resume, the coordinator loads the manifest and injects it into the restarted agent's prompt.
manifest=== RECOVERY MANIFEST ===
Phase: dependency_mapping
Completed: [architecture_scan, test_discovery]
In Progress: refund_flow_trace (agent_3)
Pending: [api_endpoint_mapping, config_analysis]
Findings: /scratchpad/findings.md
A practical Claude Code command. When context fills with verbose discovery output, /compact compresses it — reducing context usage so the session can continue. A pragmatic escape valve, not a replacement for the other techniques.
Scratchpads, subagent isolation, phase summaries, and manifests are all ways of externalizing knowledge so it doesn't depend on surviving inside the context window.
At scale, you need to know when to trust the model's output and when to route it to a human reviewer. Aggregate accuracy metrics can mask critical failures.
97% overall accuracy sounds great. But: 99.5% on standard invoices (80% of volume), 82% on handwritten purchase orders, 74% on international documents. Automating based on the aggregate number lets 1 in 4 international documents through with errors.
You must validate accuracy by document type and field segment — not just overall. A single number can't tell you where the system is reliable and where it isn't.
Pull samples from each document type and field category, including high-confidence extractions. This catches calibration drift (model becomes overconfident on new formats) and detects novel error patterns not in the original validation set. This is ongoing, not one-time.
Instead of one score per document, the model outputs a score per field. A low score on "total amount" routes just that field to human review, even if the rest looks fine.
Confidence thresholds must be calibrated using labeled validation sets. You don't pick 0.85 as a cutoff because it "feels right." You measure where the model's scores actually correlate with accuracy and set thresholds from that empirical evidence.
| Condition | Route |
|---|---|
| Low model confidence | → Human review |
| Ambiguous / contradictory source | → Human review |
| High confidence, calibrated segment | → Automated + stratified sampling safety net |
Without calibration, a confidence score is just a number. With calibration, it's a routing decision. Aggregate metrics can lie — stratify by segment to find the truth.
When multiple subagents gather information from different sources, the synthesis agent must combine it without losing track of where things came from.
Subagent A: market size is $4.2B (industry report, 2023). Subagent B: market size is $3.8B (government dataset, 2021). Synthesis agent writes: "The market size is approximately $4 billion."
Three failures: source attribution gone, temporal context gone, conflict silently resolved by averaging.
Each subagent returns findings tied to their source — URL, document name, publication date, methodology. The synthesis agent must preserve these mappings through the final output.
json{
"claim": "Market size is $4.2B",
"source": "Gartner Industry Report 2023",
"source_url": "https://...",
"publication_date": "2023-06-15",
"methodology": "Survey of 500 enterprises"
}
When credible sources disagree, include both values with explicit annotation. The subagent does not pick a winner. The coordinator or end user decides how to reconcile. Conflicts are preserved, never hidden.
The $3.8B (2021) and $4.2B (2023) figures might not conflict at all — they could reflect market growth. Without publication dates in the structured output, the synthesis agent treats them as contradictory claims about the same point in time. Requiring dates in subagent outputs prevents this specific class of misinterpretation.
The synthesis output should match the nature of the content: financial data as tables, news as prose, technical findings as structured lists. Don't flatten everything into one uniform format — that loses information and readability.
Structure reports with explicit sections distinguishing well-established findings from contested ones. Preserve original source characterizations and methodological context. The reader should always know what's settled and what's debated.
Provenance is the chain of custody for every fact. If you can't trace a claim back to its source, with its date and methodology, the synthesis is unreliable — no matter how polished the final output looks.