Claude Certified Architect — Foundation

Agentic Architecture
& Orchestration

Domain 1 — seven core concepts from agentic loops to session forking, distilled for exam readiness.

01 Agentic Loop 02 Multi-Agent 03 Subagent Mechanics 04 Enforcement 05 SDK Hooks 06 Decomposition 07 Sessions

The Agentic Loop

The agentic loop is a while loop where Claude decides when to stop — not your code. You send a message, Claude responds, and you check stop_reason to decide what happens next.

Core Rule

If stop_reason == "tool_use" → execute the tool, append the result, loop again. If stop_reason == "end_turn" → Claude is done, exit.

Loop Pseudocode

pythonmessages = [{"role": "user", "content": user_query}]

while True:
    response = client.messages.create(
        model="claude-sonnet-4-5-20250514",
        tools=my_tools,
        messages=messages
    )
    # Append Claude's response to conversation history
    messages.append({"role": "assistant", "content": response.content})

    if response.stop_reason == "end_turn":
        break

    # Execute tools, append results as a user message
    tool_results = execute_tools(response.content)
    messages.append({"role": "user", "content": tool_results})

Tool results are appended as a user message. This is how Claude "sees" what happened — it's all conversation history.

Three Anti-Patterns to Avoid

✕ Don't do this

Parsing Claude's text for "I'm done" to decide loop termination. Use stop_reason, not natural language signals.

✕ Don't do this

Hardcoding iteration caps as the primary stop mechanism. A safety cap is fine, but shouldn't be the design.

✕ Don't do this

Checking for assistant text content as a completion indicator. Claude can emit text and tool calls in the same response.

✓ Do this

Always use stop_reason as the single source of truth for loop control. Model-driven termination is the correct pattern.

Key Takeaway

Claude is the reasoning engine. Your code is the executor. The loop continues until Claude says it's done via stop_reason.

Multi-Agent Orchestration

Instead of one agent doing everything, a coordinator delegates to subagents in a hub-and-spoke pattern. The coordinator is the project manager — it never does the actual work itself.

Coordinator Responsibilities

Pattern

Analyze the request → Decompose into subtasks → Delegate to subagents → Aggregate results → Evaluate completeness → loop if gaps remain.

Hub-and-Spoke Architecture

User Query │ ▼ ┌────────────┐ │ Coordinator │ ◄── Evaluates completeness └────────────┘ │ │ │ ▼ ▼ ▼ S1 S2 S3 ◄── Subagents (isolated context) │ │ │ └────┴────┘ │ ▼ Aggregated Result

Critical Rule

Subagents have isolated context. They do NOT inherit the coordinator's conversation, other subagents' findings, or the user's original query. You must pass everything explicitly.

Smart vs. Dumb Coordination

✕ Pipeline (dumb)

Every query runs through ALL subagents in sequence regardless of complexity. "What are your hours?" hits search → analysis → synthesis.

✓ Dynamic routing (smart)

Coordinator analyzes query, picks only the subagents needed. Simple lookups skip analysis. Complex queries use the full pipeline.

The Gap-Checking Loop

After synthesis, the coordinator asks: "Is this complete?" If not, it re-delegates with targeted queries and re-invokes synthesis. This iterative refinement is what separates a good system from a basic one.

Watch Out

Over-decomposition trap: Slicing a topic too narrowly means each subagent covers a tiny slice well but misses the connections between them. Decompose at the right granularity.

Key Takeaway

The coordinator is the brain. Subagents are the hands. All communication flows through the coordinator — giving you observability, error handling, and control.

Subagent Mechanics

The hands-on mechanics of spawning subagents, passing context, and configuring agent definitions.

The Task Tool

Rule

Subagents are spawned via the Task tool. The coordinator's allowedTools must include "Task" — otherwise it cannot spawn subagents. Each Task call creates a fresh agent with its own context window.

AgentDefinition

python{
    "description": "Searches the web for current information",
    "system_prompt": "You are a focused research agent...",
    "allowed_tools": ["web_search", "web_fetch"]
}

Each subagent gets constrained tools. The description field is how the coordinator decides which subagent to invoke — it's decision logic, not documentation.

Explicit Context Passing

✕ Broken — assumes inheritance

Task("search_agent", "Find more about the topic") — the subagent has no idea what "the topic" is.

✓ Working — explicit context

Task("search_agent", "Find recent research on lithium-ion battery recycling. Prior analysis found: [hydro, pyro]. Focus on newer methods.")

Structured Data for Attribution

json{
  "findings": [{
    "content": "Recycling rates improved 30%",
    "source_url": "https://example.com/study",
    "document": "2024 Battery Report",
    "page": 14
  }],
  "gaps_identified": ["Cost comparison data missing"]
}

Separate content from metadata so downstream agents can write summaries and provide attribution.

Parallel Spawning

Power Move

Emit multiple Task calls in a single coordinator response to run subagents concurrently. Massively faster than sequential delegation when subtasks are independent.

Coordinator Prompt Philosophy

✕ Procedural

"Step 1: Call search agent. Step 2: Call analysis. Step 3: Call synthesis."

✓ Goal-oriented

"Produce a research report with 5+ credible sources covering X, Y, Z. Delegate to specialists as needed. Verify coverage before delivering."

Key Takeaway

Context doesn't flow automatically — you pump it explicitly. Use structured formats to preserve attribution. Spawn in parallel when independent. Design coordinator prompts around outcomes, not procedures.

Workflow Enforcement & Handoffs

Where theory meets production. Real workflows have rules that cannot be broken and sometimes require escalation to humans.

Prompts vs. Programmatic Enforcement

Foundational Principle

Prompts are suggestions; hooks are guarantees. If you'd get fired for the 1% failure, use a hook. If it's a preference, a prompt is fine.

Prerequisite Gates

pythondef pre_tool_hook(tool_name, tool_input):
    if tool_name == "process_refund":
        if not workflow_state["customer_verified"]:
            return BLOCK("Cannot process refund: customer not verified")
        if not workflow_state["order_retrieved"]:
            return BLOCK("Cannot process refund: order not retrieved")
    return ALLOW

Claude doesn't even know this gating exists. It tries to call the tool, gets blocked with an explanation, and naturally adjusts within the agentic loop.

Multi-Concern Decomposition

Customer says: "Wrong item, promo code didn't work, and update my address." — that's three issues. A good agent decomposes them, investigates all three in parallel using shared context, then synthesizes one unified response.

Structured Handoff Summaries

json{
  "customer_id": "CUST-4521",
  "issue_summary": "Defective product, requests full refund",
  "root_cause": "Shipped from damaged batch #B-229",
  "recommended_action": "Full refund $189.99 + 15% credit",
  "reason_for_escalation": "Exceeds $100 agent authority"
}

Key Insight

Escalation isn't failure — it's a designed part of the system. The agent's job is to minimize the work the human has to do by handing off a complete picture.

Key Takeaway

Programmatic gates for rules that can't fail. Parallel investigation for multi-concern requests. Structured summaries for human handoffs.

SDK Hooks Deep Dive

Hooks sit between Claude and the outside world, intercepting in two directions.

┌──────────┐ Claude ──call──► │ PRE-HOOK │ ──► Tool Execution └──────────┘ ┌───────────┐ Claude ◄──result─│ POST-HOOK │ ◄── Tool Execution └───────────┘

Post-Hooks: Data Normalization

Different backends return dates as Unix epochs, ISO 8601, or human strings. Claude usually handles this, but sometimes makes comparison errors. A PostToolUse hook fixes this deterministically.

pythondef post_tool_normalize(tool_name, tool_result):
    normalized = tool_result.copy()
    # Normalize all dates to ISO 8601
    for key in ["created_at", "ship_date", "due_date"]:
        if key in normalized:
            normalized[key] = to_iso8601(normalized[key])
    # Normalize status codes to readable strings
    if "status" in normalized and isinstance(normalized["status"], int):
        normalized["status"] = status_map.get(normalized["status"])
    return normalized

Design Principle

Claude should reason about your domain, not about data format inconsistencies. Normalize field names, null handling, error formats, and date representations before Claude ever sees them.

Pre-Hooks: Policy Enforcement

pythondef pre_tool_policy_check(tool_name, tool_input):
    if tool_name == "process_refund":
        amount = float(tool_input.get("amount", 0))
        if amount > 500.00:
            return {
                "action": "BLOCK",
                "message": "Exceeds $500 limit. Escalate to human."
            }
    return {"action": "ALLOW"}

The block message is instructive — it tells Claude what to do instead. Claude receives this as the tool result and naturally adjusts.

Hook vs. Prompt Decision Framework

Requirement	Hook	Prompt
Refund limits	✓	✗
Data format consistency	✓	✗
Tone of voice	✗	✓
PII redaction before logging	✓	✗
Suggested upsells	✗	✓
Blocking admin tools	✓	✗
Preferred resolution order	✗	✓

Key Takeaway

Post-hooks clean incoming data so Claude reasons clearly. Pre-hooks enforce rules so Claude can't break policy. Together they separate what Claude thinks about from what it's allowed to do.

Task Decomposition Strategies

How to break big problems into small ones — and knowing which pattern fits which situation.

Pattern A: Prompt Chaining (Fixed Pipeline)

Steps are known in advance. Each feeds into the next. Best for predictable workflows like code review, report generation, multi-language translation.

Input ──► Step 1 ──► Step 2 ──► Step 3 ──► Output

Code Review Pattern

Pass 1 (per-file): Each file gets focused individual analysis — full attention, no dilution. Pass 2 (cross-file): Integration pass to catch systemic patterns spanning multiple files. You'd miss things doing either pass alone.

Pattern B: Dynamic Adaptive Decomposition

Steps emerge as you go. What you discover in step 1 determines what step 2 even is. Best for investigative tasks like debugging, legacy test coverage, open-ended research.

Input ──► Discover ──► Plan ──► Execute ──► Adapt ──► ... ▲ │ └──────── new subtasks emerge ─────┘

Decision Framework

Question	Yes →	No →
Do I know all steps before starting?	Prompt chaining	Dynamic
Does step N change what step N+1 is?	Dynamic	Prompt chaining

Real-World Mappings

Task	Pattern
Summarize a doc in 3 languages	Prompt chaining
Debug why the app is slow	Dynamic decomposition
Review a PR against style guidelines	Prompt chaining
Migrate REST to GraphQL	Dynamic decomposition
Monthly report from 5 data sources	Prompt chaining
Figure out why customers churn	Dynamic decomposition

Hybrid Reality

In practice, many workflows use a fixed outer pipeline where individual steps use dynamic decomposition internally. The coordinator runs the pipeline; subagents adapt within each step.

Key Takeaway

Prompt chaining when predictable. Dynamic decomposition when investigative. Split big reviews into per-unit passes + cross-unit integration. Give agents permission to adapt via goal-oriented prompts.

Session State & Forking

How agents maintain continuity across time and explore branching paths.

Named Session Resumption

bash# Start a named session
claude --session "auth-refactor" "Analyze the auth module for security issues"

# Resume later
claude --resume "auth-refactor" "Now suggest fixes for the top 3 issues"

When Resumption Breaks

The Stale Context Trap

If you edited files between sessions, the agent's context contains old tool results. It doesn't know things changed. Either tell it exactly what changed (small edits) or start fresh with an injected summary (extensive changes).

Option A: Inform the agent

"I updated auth.py lines 45-60 and fixed the token bug in middleware.py. Re-analyze these two files." — good for small, targeted changes.

Option B: Fresh + injected summary

Start a new session seeded with distilled findings, status, and a list of changes. More reliable when files have changed extensively.

Fork-Based Session Management

Shared baseline analysis │ ├── Fork A: "Fix using JWT with refresh rotation" │ └── Fork B: "Fix using session-based auth with Redis" Both forks see the full analysis. Neither contaminates the other.

Where Forking Shines

Use Cases

Comparing two testing strategies. Trying two refactoring approaches before committing. Exploring optimistic vs. conservative solutions. A/B testing prompt strategies for a subagent.

Decision Framework

Situation	Strategy
Files mostly unchanged	Resume the session
Files heavily modified	Fresh session + injected summary
Want to explore 2+ approaches	Fork from shared baseline

Linear: S1 ──► S2 ──► S3 (resume) Branching: S1 ──► S2 ──┬── S3a (fork) └── S3b Restart: S1 ──► S2 S3(summary) (fresh + inject) ✕ ▲ stale condensed knowledge

Key Takeaway

Resume when context is fresh. Fork when exploring alternatives. Start fresh with injected summaries when the world has changed. The judgment call is always about context freshness.

Claude Certified Architect — Foundation

Tool Design &
MCP Integration

Domain 2 — covering tool descriptions, error handling, tool distribution, tool_choice, MCP servers, and built-in tools.

Sections

01 Tool Descriptions — Claude's Compass 02 Anatomy of a Great Description 03 System Prompt Sabotage 04 Structured Error Responses 05 Error Recovery in Multi-Agent Systems 06 Tool Distribution — Less Is More 07 tool_choice Configuration 08 MCP Server Integration 09 Built-In Tools

Task 2.1

Tool Descriptions — Claude's Decision-Making Compass

Claude doesn't see your code. It reads tool descriptions to decide which tool to use. When descriptions are vague or overlapping, Claude guesses — and guesses wrong.

✗ Vague & Overlapping

"analyze_content"
"Analyzes content and returns results"

"analyze_document"
"Analyzes a document and provides analysis"

✓ Clear & Differentiated

"extract_web_results"
"Extracts structured data from search
result pages. Input: raw HTML. Output:
JSON array of {title, url, snippet}.
Use instead of summarize_content for
search engine output."

Rule of thumb: If you can't tell two tools apart by reading their descriptions alone, neither can Claude.

Task 2.1

Anatomy of a Bulletproof Tool Description

Five ingredients separate great descriptions from bad ones.

🎯

Purpose Statement

One clear sentence on what the tool does.

📥

Input Format

What does Claude need to feed this tool?

📤

Output Format

What comes back?

⚠️

Edge Cases

What happens with weird or empty input?

🚧

Boundary Explanation

When to use this tool instead of another.

All Five in Action

"verify_claim_against_source"

PURPOSE:   "Checks whether a factual claim is supported
           by a given source document."

INPUT:    "JSON with 'claim' (string, one sentence)
           and 'source_text' (string, full text)."

OUTPUT:   "JSON with 'supported' (bool), 'confidence'
           (0.0-1.0), and 'relevant_excerpt' (string)."

EDGE:     "Returns supported=false, confidence=0.0 if
           source is empty or claim not addressed."

BOUNDARY: "Use this for fact-checking a single claim.
           Use summarize_content for overviews.
           Use extract_data_points for structured data."

Split Generic Tools Into Specialists

✗ One Generic Tool

analyze_document
Does everything vaguely — Claude
has to guess which behavior you want

✓ Purpose-Specific Tools

extract_data_points  → structured fields
summarize_content    → overviews
verify_claim         → fact-checking

💡 Mental Model

Hiring three specialists (plumber, electrician, painter) instead of one generalist. Each knows their lane — results are better across the board.

Task 2.1

System Prompt Sabotage

Perfect tool descriptions can still fail if your system prompt contains keywords that accidentally create tool associations.

The Keyword Magnet Problem

# System prompt contains:
"Always analyze content before responding."

# Tool name happens to be:
analyze_content

# Result: Claude gravitates to this tool
# regardless of context, because the keyword
# match is strongest.

The Two-Step Audit

Step 1: Scan for Action Verbs

Find words like "analyze," "search," "check," "extract," "look up" in your system prompt. These are the magnets.

Step 2: Cross-Reference Tool Names

If any of those verbs match a tool name or description, you have a potential collision. Reword the system prompt.

✗ Keyword Collision

System prompt:
  "Always analyze content carefully..."
Tool name:
  analyze_content

✓ Neutral Language

System prompt:
  "Be thorough when processing requests..."
Tool name:
  extract_web_results

The Golden Rule: System prompts define behaviors and goals. Tool descriptions handle tool selection. Don't let them bleed into each other.

Task 2.2

Structured Error Responses

When a tool fails, does it give Claude enough information to recover intelligently? Generic errors lead to generic (bad) recovery.

Four Error Categories

⏳

Transient

Server timeout, temporary unavailability. Retryable.

🚫

Validation

Malformed input, wrong data type. Not retryable without fixing input.

📋

Business

Policy violation, rule blocked. Not retryable.

🔒

Permission

No access to this resource. Not retryable without escalation.

Generic vs Structured Error

✗ Generic — Useless

{
  "isError": true,
  "content": "Operation failed"
}

✓ Structured — Actionable

{
  "isError": true,
  "content": "Order below $50 minimum",
  "metadata": {
    "errorCategory": "business",
    "isRetryable": false,
    "customerMessage": "Discounts
      require orders of $50+",
    "currentTotal": 32.99
  }
}

Critical Gotcha: Empty Results ≠ Errors

✗ Empty = Error (Wrong)

{
  "isError": true,
  "content": "No customers found"
}
Claude may retry or apologize for a "failure"

✓ Empty = Valid Result (Correct)

{
  "isError": false,
  "content": "0 customers matched",
  "results": []
}
Claude treats this as useful information

💡 Mental Model

Think of error responses like a doctor's diagnosis. "Something's wrong" isn't helpful. "Sprained ankle, no surgery needed, rest for two weeks" — now you know exactly what to do.

Task 2.2

Error Recovery in Multi-Agent Systems

Who handles the error — the subagent or the coordinator? It depends on the error type.

The Principle: Handle Locally, Escalate Strategically

🔄

Subagent Handles

Transient failures — retry once or twice locally. Only escalate if still failing.

📡

Coordinator Handles

Errors the subagent can't resolve, workflow-affecting failures, skip/reroute/abort decisions.

Escalation Includes Context

When local recovery fails, the subagent sends the coordinator everything it knows:

{
  "isError": true,
  "content": "Customer API timed out after 2 retries",
  "metadata": {
    "errorCategory": "transient",
    "isRetryable": true,
    "attemptsMade": 2,
    "partialResults": null,
    "whatWasAttempted": "GET /customers/{id}"
  }
}

Preserve Partial Results

If 2 of 3 lookups succeed, send back what you got. Never discard good work because of one failure.

{
  "partialResults": {
    "customerProfile": { "name": "Jane", "tier": "gold" },  ✓
    "orderHistory": [{ "id": "ORD-123" }],              ✓
    "supportTickets": null                              ✗ failed
  },
  "failedStep": "supportTickets"
}

The coordinator's decision tree: Enough partial results to answer? → Deliver them. Can reroute to fallback? → Try it. Non-retryable error? → Inform user. Never silently drop information.

Task 2.3

Tool Distribution — Less Is More

More tools means worse decisions. Each agent should only have tools relevant to its role.

✗ Overloaded — 18 Tools

synthesis_agent = Agent(
  tools=[
    "merge_findings",
    "generate_summary",
    "web_search",       # not its job!
    "fetch_url",        # not its job!
    ... 14 more ...
  ]
)

✓ Scoped — 2-3 Tools

synthesis_agent = Agent(
  tools=[
    "merge_findings",
    "generate_summary",
    "verify_fact",  # scoped cross-role
  ]
)

Scoped Tool Access by Role

Agent	Tools
research	web_search, fetch_url, extract_data
analysis	run_query, compute_stats, generate_chart
synthesis	merge_findings, generate_summary
coordinator	delegate_to_agent (no domain tools)

Constrained Alternatives

When a specialized agent needs limited cross-role access, replace the generic tool with a constrained version:

✗ Generic

tools=["fetch_url"]
Agent can fetch ANY URL — dangerous

✓ Constrained

tools=["load_document"]
Only accepts validated document URLs

💡 Mental Model

A kitchen: the pastry chef has whisks and piping bags. The sushi chef has knives and a bamboo mat. You could give both every tool — but the pastry chef doesn't need a sushi knife, and handing them one invites accidents.

Task 2.3

tool_choice Configuration

Three modes that control whether and which tool Claude calls.

🤖

"auto"

Claude decides whether to call a tool and which one. Default mode — best for open-ended tasks.

⚡

"any"

Claude MUST call a tool, but picks which. Prevents conversational responses when action is needed.

🎯

Forced

Claude MUST call THIS specific tool. Guarantees a step runs — used for enforcing pipeline order.

Forced → Then Release: The Key Pattern

# Turn 1: Force metadata extraction first
response = client.messages.create(
    tools=[extract_metadata, enrich, validate],
    tool_choice={"type": "tool", "name": "extract_metadata"}
)

# Turn 2: Now let Claude choose freely
response = client.messages.create(
    tools=[enrich, validate],
    tool_choice={"type": "auto"}
)

Don't force every step. If you're forcing every single tool call, you've built a script, not an agent. Force only the steps that must happen, then let Claude reason about the rest.

Decision Framework

Question	Answer
Need Claude to always act?	tool_choice: `"any"`
Specific tool must run first?	Forced, then switch to `"auto"`
Claude choosing well on its own?	tool_choice: `"auto"` (default)
Claude choosing poorly?	Fix tool descriptions first — force as last resort

Task 2.4

MCP Server Integration

How external tools actually connect to Claude through MCP servers — configuration, scoping, and best practices.

What's an MCP Server?

Claude Agent ←→ MCP Server ←→ External Service

A translator between Claude and external services. Tools are discovered at connection time — no manual registration needed.

Project vs User Scope

`.mcp.json`

Project-level. Committed to version control. Shared with the whole team. Use for Jira, GitHub, shared DBs.

`~/.claude.json`

User-level. Personal to you. Use for experiments and tools you're testing before proposing to the team.

Configuration with Env Var Expansion

// .mcp.json — safe to commit (no actual secrets)
{
  "mcpServers": {
    "github-server": {
      "command": "npx",
      "args": ["-y", "@mcp/server-github"],
      "env": {
        "GITHUB_TOKEN": "${GITHUB_TOKEN}"  ← from .env
      }
    }
  }
}

Community vs Custom Servers

Need	Approach
Jira / GitHub / Slack	Use existing community MCP server ✓
Standard integration	Community server — don't reinvent it ✓
Proprietary internal tool	Build a custom MCP server ✓

Enhance MCP Tool Descriptions

Claude prefers its built-in tools (like Grep) when MCP tool descriptions are vague. The fix: tell Claude explicitly when the MCP tool is better.

// Vague — Claude ignores it, uses Grep instead
"Searches code"

// Detailed — Claude understands the advantage
"Semantic search across the full codebase using
embeddings. Finds conceptually related code even
without exact keyword matches.
Use over Grep when searching by concept."

MCP Resources: The Table of Contents

✗ Without Resources

10 exploratory tool calls just to
understand what data is available

list_issues() → 50 results
get_issue(1) → read
get_issue(2) → read
...

✓ With Resources

Browse the catalog first, then go
straight to the relevant item

resource: issue_summaries → scan
get_issue(47) → the one that matters

Task 2.5

Built-In Tools

Six built-in tools, each with a specific sweet spot. Using the wrong one wastes time and context.

The Lineup

Tool	Purpose
Grep	Search inside files — find function names, error messages, import statements
Glob	Search for files — match by name or extension pattern (e.g., */.test.tsx)
Read	Load a file's full contents into context
Write	Create or fully overwrite a file
Edit	Surgically modify a specific part of a file (requires unique text match)
Bash	Run any shell command (tests, installs, scripts)

Grep vs Glob — The Key Distinction

Grep = What's INSIDE files?

"Where is calculateTotal defined?"
grep -r "calculateTotal" src/

Glob = What files EXIST?

"Find all test files"
glob("**/*.test.tsx")

Edit vs Read+Write Fallback

Edit works by matching unique text. If the target text appears multiple times, Edit fails. Fall back to Read+Write:

# Edit works — "MAX_RETRIES = 3" is unique
Edit(file="config.py", old="MAX_RETRIES = 3", new="MAX_RETRIES = 5")

# Edit fails — "return None" appears 6 times
# Fallback: Read → modify programmatically → Write
content = Read("handlers.py")
updated = content.replace("return None  # fallback", "return default", 1)
Write("handlers.py", updated)

Incremental Codebase Understanding

Grep entry point → Read that file → Follow imports → Read next file

Start narrow, expand as needed. Never read every file upfront — most will be irrelevant and waste your context window.

Quick Reference

I need to…	Use
Find files containing "TODO"	Grep
Find all .yaml config files	Glob
Understand what a file does	Read
Create a new file	Write
Change one line in a big file	Edit
Edit fails (non-unique match)	Read + Write
Run tests or install packages	Bash
Map function usage	Grep → Read → Grep

Claude Certified Architect — Foundation

Claude Code Configs & Workflows

Real file structures and annotated code examples for every concept

Claude Certified Architect — Foundation

Prompt Engineering
& Structured Output

Tasks 4.1 & 4.2 — Explicit criteria, false positive management, and few-shot prompting for consistent, reliable output.

Task Statement 4.1

Design Prompts with Explicit Criteria

The Core Problem

Vague instructions like "be conservative" or "only report high-confidence findings" give Claude no actionable framework. It still has to decide what conservative means for every finding — and it will decide inconsistently. Explicit categorical criteria always outperform confidence-based filtering.

✕ Vague Prompt

"Review this code and flag issues. Be conservative — only report high-confidence findings."

✓ Explicit Prompt

"Flag comments only when claimed behavior contradicts actual code behavior. Do NOT flag: minor style differences, local naming conventions, or TODO comments."

The Trust Cascade

False positives are poisonous — not just locally, but globally. If your code review agent has a 60% false positive rate on style suggestions, developers start ignoring everything, including accurate bug reports. High false-positive categories undermine confidence in accurate categories.

Architecture Insight

The fix for a high false-positive category isn't tweaking confidence thresholds. It's either making the criteria razor-sharp with concrete code examples, or temporarily disabling that category entirely while you improve the prompts. Restoring trust comes first.

Techniques

Building Precise Criteria

Define what TO report — with code examples +

Don't just say "flag bugs." Provide concrete examples of what constitutes a bug in your context:

// FLAG — Assignment in conditional (likely a bug)
if (user.age = 18) { ... }

// FLAG — Unchecked null dereference
const name = response.data.user.name;
// when response.data could be null
        

Each severity level should have its own concrete examples. Abstract definitions like "high = significant impact" produce inconsistent classifications.

Define what to SKIP — equally important +

Telling Claude what not to flag is where most false positive reduction comes from:

// SKIP — Short names in reduce callbacks (idiomatic)
const sum = arr.reduce((a, b) => a + b, 0);

// SKIP — TODO comments (tracked elsewhere)
// TODO: refactor after v2 migration

// SKIP — Team-local naming conventions
const usr_mgr = new UserManager();
        

Disable high false-positive categories temporarily +

If a category (e.g., "style suggestions") consistently produces noise, the pragmatic move is to disable it entirely while you refine the prompts.

Why? Because every false positive trains developers to click "dismiss" reflexively. You need to break that habit before re-introducing the category with better criteria. Trust is rebuilt incrementally.

Exam Heuristic

If an exam question offers "lower the confidence threshold" vs. "add explicit categorical criteria" — always pick explicit criteria. Confidence-based filtering is the distractor pattern in this domain.

Task Statement 4.2

Few-Shot Prompting for Consistency

Why Instructions Alone Aren't Enough

You can write the most detailed instructions imaginable — "provide location, issue, severity, and suggested fix" — and across 50 runs, you'll still get wildly different interpretations of severity, varying detail levels, and inconsistent formatting.

Instructions describe the shape of the output. Few-shot examples demonstrate the judgment.

The Mental Model

Instructions are the rules. Few-shot examples are the case law. Rules tell you what's legal. Case law shows you how ambiguous situations were actually decided.

What Makes a Good Few-Shot Example

2–4 examples — enough to establish the pattern, not so many that you waste context
Target ambiguous cases — easy cases teach Claude nothing useful
Show reasoning for why one action was chosen over a plausible alternative
Demonstrate the exact output format you expect
Distinguish acceptable patterns vs. genuine issues

In Practice

Code Review Agent — Two Examples

Example 1 — Flag (genuine bug)

Code: if (user.age = 18)
Finding:
{
  "location": "line 12",
  "issue": "Assignment in conditional",
  "severity": "high",
  "fix": "Change = to =="
}
Reasoning: Single = in a conditional
is almost never intentional.
        

Example 2 — Skip (acceptable pattern)

Code: const x = arr.reduce(
  (a, b) => a + b, 0
);
Finding: null
Reasoning: Short variable names
in reduce callbacks are idiomatic.
Not a style issue worth flagging.
        

Example 2 Does the Heavy Lifting

The "skip" example is where the real value lives. It teaches Claude what not to flag — which directly reduces false positives. Without it, Claude will over-report acceptable patterns as issues.

Few-Shot Examples Kill Hallucination in Extraction

When extracting data from messy documents — informal measurements, varied structures, missing fields — showing Claude how to handle different document formats prevents it from fabricating values when the format doesn't match expectations.

Example use cases:

Inline citations vs. bibliographies — show both
Methodology sections vs. embedded details — show where to look
Informal measurements ("about 3 feet") — show how to normalize
Missing fields — show returning null instead of inventing data

Generalization, Not Memorization

Good few-shot examples teach Claude to generalize judgment to novel patterns, not just match the specific cases you showed. That's why showing reasoning alongside each example matters — Claude learns the decision-making process, not just the outcome.

Quick Reference

Exam Cheat Sheet — 4.1 & 4.2

Task 4.1 — Key Takeaways

Explicit categorical criteria > confidence-based filtering — always
"Be conservative" is not a precision strategy — it's a vague instruction
Define what to report AND what to skip with concrete code examples
Each severity level needs its own concrete examples, not abstract descriptions
High false-positive categories poison trust in ALL categories
Temporarily disable noisy categories rather than tolerating false positives
Think "contract writing" — not "giving advice"

Task 4.2 — Key Takeaways

Few-shot examples are THE most effective technique when instructions produce inconsistent output
2–4 examples targeting ambiguous cases — not easy ones
Each example shows reasoning for WHY one action was chosen
"Skip" examples reduce false positives more than "flag" examples
Few-shot enables generalization to novel patterns — not just pattern matching
For extraction tasks: show handling of varied document structures to prevent hallucination
Instructions = rules. Few-shot examples = case law.

Distractor Patterns to Watch For

❌ "Lower the confidence threshold" — confidence is not the lever
❌ "Add more detailed instructions" when few-shot is the better answer
❌ "Increase the number of examples to 10+" — diminishing returns, wastes context
❌ "Use self-review to catch false positives" — wrong architecture (that's 4.6)
✓ When in doubt: explicit criteria + targeted few-shot examples

Your Known Pattern — Watch for It Here

This domain may offer distractors that collapse prompt engineering into a single API parameter (e.g., confidence_threshold: 0.9 in a constructor). Real precision comes from prompt design, not configuration flags. Same principle — different domain.

Claude Certified Architect — Foundation

Context Management
& Reliability

Domain 5 — six task statements covering context degradation, escalation logic, error propagation, codebase exploration, human review, and information provenance.

5.1a

The Three Context Threats

Context is a finite, degrading resource. Every long interaction is a fight against three forces that erode the information your agent needs to do its job.

Progressive Summarization Risk

When context grows long, the system summarizes earlier turns. Concrete transactional facts — dollar amounts, dates, order numbers — get smoothed into vague descriptions. The exact data the agent needs to resolve an issue disappears.

✗ What the summary retains

"Customer disputes a pricing discrepancy on a recent order."

✓ What was actually said

"Charged $47.99 on March 3rd for order #8812 — website showed $39.99."

The "Lost in the Middle" Effect

Models process information at the beginning and end of long inputs most reliably. Information buried in the middle of a large context window is more likely to be overlooked. If the customer's key complaint was stated in turn 14 of a 30-turn conversation, it's in the danger zone.

Position Effect

Beginning of context = high reliability. End of context = high reliability. Middle of context = degraded attention. Place critical data at the top or bottom, never buried in the middle.

Tool Result Bloat

An lookup_order call returns 40+ fields — shipping carrier tracking events, warehouse codes, internal SKU data, tax breakdowns. The agent only needed 5 fields. Those 40 fields now sit in context, consuming tokens and diluting signal.

Across a multi-turn session with several lookups, this compounds: 200 irrelevant fields vs. 25 relevant ones.

Core Principle

Context isn't just "the conversation so far." It's an engineered input that you actively curate — deciding what enters, what format it takes, and where it's positioned.

5.1b

Fighting Context Degradation

Four techniques to counter the three threats. Each targets a specific failure mode.

Technique 1: The "Case Facts" Block

Extract critical transactional data into a separate structured block included in every prompt, outside of summarized history. The conversation can be condensed — the case facts block always stays intact.

case facts=== CASE FACTS ===
Customer: Jane Martinez, ID #4421
Issue 1: Order #8812 — charged $47.99, expected $39.99, placed March 3
Issue 2: Order #8840 — return requested, status: delivered March 7
Escalation: None requested
=== END CASE FACTS ===

For multi-issue sessions, this becomes a structured context layer — a living document that grows as new issues surface.

Pattern — Persistent Context Layer

Structured issue data (order IDs, amounts, statuses) lives in a separate layer from conversation history. History can be summarized aggressively because the facts are protected externally.

Technique 2: Trimming Tool Outputs

Post-process tool results before they enter context. Keep only the fields relevant to the agent's current role. A 40-field order object becomes 5 fields of signal.

✗ Raw tool output

40 fields: warehouse routing codes, internal flags, tax line items, carrier tracking events, SKU metadata…

✓ Trimmed output

5 fields: order ID, item total, order date, status, return eligibility

Technique 3: Strategic Positioning

Counter the lost-in-the-middle effect with deliberate placement. Key findings summaries go at the beginning of aggregated inputs. Detailed results use explicit section headers. The case facts block sits at the top of the prompt.

Technique 4: Structured Subagent Outputs

Upstream subagents return structured data (key facts, citations, relevance scores) instead of verbose content and reasoning chains. Downstream agents with limited context budgets receive clean, compact inputs — not paragraphs of reasoning to parse.

Exam Rule — Metadata Requirement

Subagents must include metadata in structured outputs: dates, source locations, and methodological context. This isn't optional — it supports accurate downstream synthesis (see Task 5.6).

Throughline

Every technique here is about the same thing — being intentional about what enters context, what format it's in, and where it's positioned.

5.2

Escalation & Ambiguity Resolution

Escalation isn't about difficulty — it's about authority boundaries. The curriculum defines three specific triggers.

The Three Legitimate Triggers

Trigger	When It Fires	Key Nuance
Customer requests human	Explicit demand for a person	Non-negotiable hard rule — honor immediately
Policy exception / gap	Policy is silent or ambiguous on the request	Don't guess by analogy — silence = gap = escalate
No meaningful progress	Agent has tried and is genuinely stuck	Not "this is complex" — but "I cannot move forward"

What's NOT a Trigger

Complexity alone. A case can be complex but fully within the agent's capability. Escalating because something has multiple steps wastes human capacity.

The "Let Me Help" vs. "Right Away" Distinction

Soft request: Customer says "I want to talk to a manager." The issue is a simple return. You get one offer to resolve. If they reiterate, escalate immediately. No second attempts.

Hard request: Customer says "Get me a human NOW." Escalate immediately. No investigation, no "let me just check." Emphatic demands are honored without delay.

Hard Rule — Explicit Escalation

When a customer explicitly demands a human, this is a non-negotiable hard rule. Always honor it immediately. This is NOT conditional on the agent's assessment of the case.

Unreliable Escalation Proxies

✗ Sentiment-based escalation

"If customer sounds angry, escalate." Angry customers may have simple issues. Calm customers may have impossible ones. Tone ≠ complexity.

✗ Self-reported confidence scores

"If model confidence < 0.7, escalate." Claude isn't well-calibrated at rating its own certainty. High confidence can coexist with wrong answers.

The Policy Gap Pattern

Your policy covers price matching against your own site's pricing errors. Customer asks for a competitor price match. The policy says nothing about competitors. The agent should not apply existing policy by analogy — silence in the policy is a gap, and gaps belong to humans.

Multiple Customer Matches

Agent searches and finds two "Jane Martinez" accounts. The correct response is to ask for an additional identifier (zip code, email), not to pick one based on heuristics like "most recent order."

Pattern — Clarify, Don't Guess

When uncertain — whether about policy, customer identity, or case facts — ask for clarification rather than making a heuristic selection. This applies to ambiguous tool results, ambiguous customer requests, and ambiguous policy interpretation.

Throughline

Escalate when you lack the authority (policy gaps), the ability (no progress), or the permission (customer demands it). Everything else, resolve.

5.3

Error Propagation in Multi-Agent Systems

When a subagent fails, the coordinator needs enough context to make a smart decision — retry, fall back, proceed with partial results, or surface the gap to the user.

The Two Anti-Patterns

✗ Silent Suppression

Inventory subagent times out → returns empty result as if it succeeded → coordinator tells customer "item out of stock" → but the item isn't out of stock, the system just didn't respond. False information delivered.

✗ Total Termination

Inventory subagent times out → entire workflow halts → order lookup and return check both succeeded but their results are discarded → customer gets "something went wrong." Useful work wasted.

Access Failure vs. Valid Empty Result

This is the foundational distinction. A subagent gets nothing back — but why?

Type	What Happened	Coordinator Action
Access failure	Timeout, 500 error, connection dropped — query never completed	Retry, use fallback, inform customer of temp issue
Valid empty result	Query succeeded normally, no matching records exist	Confidently tell customer no record exists

If the subagent returns "no results" for both cases, the coordinator can't distinguish them. Generic error statuses like "service unavailable" strip away decision-critical context.

Structured Error Context

json{
  "status": "error",
  "failure_type": "timeout",
  "what_was_attempted": "inventory check for SKU #4419",
  "partial_results": null,
  "retry_recommended": true,
  "alternatives": "check warehouse B system directly"
}

Pattern — Local Recovery Before Propagation

Subagents handle transient failures locally (retry once or twice). Only propagate errors they cannot resolve — including what was already attempted and any partial results. This prevents redundant coordinator retries.

Coverage Annotations in Synthesis

When the coordinator assembles a response and one subagent failed, the output flags the gap explicitly: "Your order #8812 is eligible for return. Note: I wasn't able to check current inventory for a replacement — I can check again or connect you with a team member."

Exam Tip — Preserve Partial Results

On failure, return structured error objects with partial results and coverage annotations — not silent failures. This is a recurring principle across the certification: always preserve work that already succeeded.

Throughline

Errors are data, not just failures. A well-designed system treats every error as a decision point for the coordinator — with enough structured context to decide intelligently.

5.4

Context in Codebase Exploration

Same degradation problems, different domain. Claude Code explores a massive codebase — thousands of files, complex dependency chains. The context window fills with verbose discovery output, and early findings get lost.

Symptom — Context Degradation

Early in the session, the agent references specific classes it discovered. Later, it starts talking about "typical patterns" instead — generic training knowledge replacing its own findings. The context filled up and the discoveries were pushed out.

Technique 1: Scratchpad Files

The agent maintains a file on disk recording key findings as it goes. When it needs to answer a question later, it reads the scratchpad first. Findings are on disk — not dependent on surviving in the context window.

scratchpad=== SCRATCHPAD: Project Architecture ===
Entry point:      src/main/App.java
Refund logic:     src/services/RefundCalculator.java
Dependencies:     PricingEngine, TaxService, InventoryClient
Tests:            tests/services/RefundCalculatorTest.java (47 cases)
Config:           refund thresholds in config/policies.yaml

Technique 2: Subagent Delegation

Instead of the main agent grepping through hundreds of files (filling its own context), it spawns subagents with isolated context windows:

Coordinator (clean context — high-level map) ├─→ Subagent A: "Find all test files related to refunds" ├─→ Subagent B: "Trace RefundCalculator dependency chain" └─→ Subagent C: "Identify API endpoints that trigger refunds" │ └─→ Each returns concise structured summary (verbose exploration stays contained)

Technique 3: Phase-Based Summarization

Summarize findings from Phase 1 before spawning subagents for Phase 2. Inject those summaries into the new subagents' initial context. They start with knowledge, not from scratch.

Technique 4: Crash Recovery with State Manifests

Each agent periodically exports its state to a known file location. A coordinator-level manifest tracks completion status across all agents. On resume, the coordinator loads the manifest and injects it into the restarted agent's prompt.

manifest=== RECOVERY MANIFEST ===
Phase:       dependency_mapping
Completed:  [architecture_scan, test_discovery]
In Progress: refund_flow_trace (agent_3)
Pending:    [api_endpoint_mapping, config_analysis]
Findings:   /scratchpad/findings.md

Technique 5: /compact

A practical Claude Code command. When context fills with verbose discovery output, /compact compresses it — reducing context usage so the session can continue. A pragmatic escape valve, not a replacement for the other techniques.

Throughline

Scratchpads, subagent isolation, phase summaries, and manifests are all ways of externalizing knowledge so it doesn't depend on surviving inside the context window.

5.5

Human Review & Confidence Calibration

At scale, you need to know when to trust the model's output and when to route it to a human reviewer. Aggregate accuracy metrics can mask critical failures.

The Aggregate Metric Trap

Hidden Failure

97% overall accuracy sounds great. But: 99.5% on standard invoices (80% of volume), 82% on handwritten purchase orders, 74% on international documents. Automating based on the aggregate number lets 1 in 4 international documents through with errors.

You must validate accuracy by document type and field segment — not just overall. A single number can't tell you where the system is reliable and where it isn't.

Stratified Random Sampling

Pull samples from each document type and field category, including high-confidence extractions. This catches calibration drift (model becomes overconfident on new formats) and detects novel error patterns not in the original validation set. This is ongoing, not one-time.

Field-Level Confidence Scores

Instead of one score per document, the model outputs a score per field. A low score on "total amount" routes just that field to human review, even if the rest looks fine.

Critical Step — Calibration

Confidence thresholds must be calibrated using labeled validation sets. You don't pick 0.85 as a cutoff because it "feels right." You measure where the model's scores actually correlate with accuracy and set thresholds from that empirical evidence.

Routing Review Attention

Condition	Route
Low model confidence	→ Human review
Ambiguous / contradictory source	→ Human review
High confidence, calibrated segment	→ Automated + stratified sampling safety net

Throughline

Without calibration, a confidence score is just a number. With calibration, it's a routing decision. Aggregate metrics can lie — stratify by segment to find the truth.

5.6

Provenance & Multi-Source Synthesis

When multiple subagents gather information from different sources, the synthesis agent must combine it without losing track of where things came from.

The Provenance Problem

Subagent A: market size is $4.2B (industry report, 2023). Subagent B: market size is $3.8B (government dataset, 2021). Synthesis agent writes: "The market size is approximately $4 billion."

Three failures: source attribution gone, temporal context gone, conflict silently resolved by averaging.

Structured Claim-Source Mappings

Each subagent returns findings tied to their source — URL, document name, publication date, methodology. The synthesis agent must preserve these mappings through the final output.

json{
  "claim": "Market size is $4.2B",
  "source": "Gartner Industry Report 2023",
  "source_url": "https://...",
  "publication_date": "2023-06-15",
  "methodology": "Survey of 500 enterprises"
}

Conflicts: Annotate, Don't Arbitrate

Hard Rule — Conflict Handling

When credible sources disagree, include both values with explicit annotation. The subagent does not pick a winner. The coordinator or end user decides how to reconcile. Conflicts are preserved, never hidden.

Temporal Context Prevents False Contradictions

The $3.8B (2021) and $4.2B (2023) figures might not conflict at all — they could reflect market growth. Without publication dates in the structured output, the synthesis agent treats them as contradictory claims about the same point in time. Requiring dates in subagent outputs prevents this specific class of misinterpretation.

Rendering by Content Type

The synthesis output should match the nature of the content: financial data as tables, news as prose, technical findings as structured lists. Don't flatten everything into one uniform format — that loses information and readability.

Report Structure

Pattern — Well-Established vs. Contested

Structure reports with explicit sections distinguishing well-established findings from contested ones. Preserve original source characterizations and methodological context. The reader should always know what's settled and what's debated.

Throughline

Provenance is the chain of custody for every fact. If you can't trace a claim back to its source, with its date and methodology, the synthesis is unreliable — no matter how polished the final output looks.

Agentic Architecture& Orchestration

The Agentic Loop

Loop Pseudocode

Three Anti-Patterns to Avoid

Multi-Agent Orchestration

Coordinator Responsibilities

Hub-and-Spoke Architecture

Smart vs. Dumb Coordination

The Gap-Checking Loop

Subagent Mechanics

The Task Tool

AgentDefinition

Explicit Context Passing

Structured Data for Attribution

Parallel Spawning

Coordinator Prompt Philosophy

Workflow Enforcement & Handoffs

Prompts vs. Programmatic Enforcement

Prerequisite Gates

Multi-Concern Decomposition

Structured Handoff Summaries

SDK Hooks Deep Dive

Post-Hooks: Data Normalization

Pre-Hooks: Policy Enforcement

Hook vs. Prompt Decision Framework

Task Decomposition Strategies

Pattern A: Prompt Chaining (Fixed Pipeline)

Pattern B: Dynamic Adaptive Decomposition

Decision Framework

Real-World Mappings

Session State & Forking

Named Session Resumption

When Resumption Breaks

Fork-Based Session Management

Where Forking Shines

Decision Framework

Tool Design &MCP Integration

Tool Descriptions — Claude's Decision-Making Compass

Anatomy of a Bulletproof Tool Description

Purpose Statement

Input Format

Output Format

Edge Cases

Boundary Explanation

All Five in Action

Split Generic Tools Into Specialists

System Prompt Sabotage

The Keyword Magnet Problem

The Two-Step Audit

Step 1: Scan for Action Verbs

Step 2: Cross-Reference Tool Names

Structured Error Responses

Four Error Categories

Transient

Validation

Business

Permission

Generic vs Structured Error

Critical Gotcha: Empty Results ≠ Errors

Error Recovery in Multi-Agent Systems

The Principle: Handle Locally, Escalate Strategically

Subagent Handles

Coordinator Handles

Escalation Includes Context

Preserve Partial Results

Tool Distribution — Less Is More

Scoped Tool Access by Role

Constrained Alternatives

tool_choice Configuration

"auto"

"any"

Forced

Forced → Then Release: The Key Pattern

Decision Framework

MCP Server Integration

What's an MCP Server?

Project vs User Scope

.mcp.json

~/.claude.json

Configuration with Env Var Expansion

Agentic Architecture
& Orchestration

Tool Design &
MCP Integration

`.mcp.json`

`~/.claude.json`

Prompt Engineering
& Structured Output

Context Management
& Reliability