Domain 1 — Agentic Architecture & Orchestration
Claude Certified Architect — Foundation
Claude Certified Architect — Foundation

Agentic Architecture
& Orchestration

Domain 1 — seven core concepts from agentic loops to session forking, distilled for exam readiness.

01

The Agentic Loop

The agentic loop is a while loop where Claude decides when to stop — not your code. You send a message, Claude responds, and you check stop_reason to decide what happens next.

Core Rule

If stop_reason == "tool_use" → execute the tool, append the result, loop again. If stop_reason == "end_turn" → Claude is done, exit.

Loop Pseudocode

pythonmessages = [{"role": "user", "content": user_query}]

while True:
    response = client.messages.create(
        model="claude-sonnet-4-5-20250514",
        tools=my_tools,
        messages=messages
    )
    # Append Claude's response to conversation history
    messages.append({"role": "assistant", "content": response.content})

    if response.stop_reason == "end_turn":
        break

    # Execute tools, append results as a user message
    tool_results = execute_tools(response.content)
    messages.append({"role": "user", "content": tool_results})

Tool results are appended as a user message. This is how Claude "sees" what happened — it's all conversation history.

Three Anti-Patterns to Avoid

✕ Don't do this

Parsing Claude's text for "I'm done" to decide loop termination. Use stop_reason, not natural language signals.

✕ Don't do this

Hardcoding iteration caps as the primary stop mechanism. A safety cap is fine, but shouldn't be the design.

✕ Don't do this

Checking for assistant text content as a completion indicator. Claude can emit text and tool calls in the same response.

✓ Do this

Always use stop_reason as the single source of truth for loop control. Model-driven termination is the correct pattern.

Key Takeaway

Claude is the reasoning engine. Your code is the executor. The loop continues until Claude says it's done via stop_reason.

02

Multi-Agent Orchestration

Instead of one agent doing everything, a coordinator delegates to subagents in a hub-and-spoke pattern. The coordinator is the project manager — it never does the actual work itself.

Coordinator Responsibilities

Pattern

Analyze the request → Decompose into subtasks → Delegate to subagents → Aggregate results → Evaluate completeness → loop if gaps remain.

Hub-and-Spoke Architecture

User Query │ ▼ ┌────────────┐ │ Coordinator │ ◄── Evaluates completeness └────────────┘ │ │ │ ▼ ▼ ▼ S1 S2 S3 ◄── Subagents (isolated context) │ │ │ └────┴────┘ │ ▼ Aggregated Result
Critical Rule

Subagents have isolated context. They do NOT inherit the coordinator's conversation, other subagents' findings, or the user's original query. You must pass everything explicitly.

Smart vs. Dumb Coordination

✕ Pipeline (dumb)

Every query runs through ALL subagents in sequence regardless of complexity. "What are your hours?" hits search → analysis → synthesis.

✓ Dynamic routing (smart)

Coordinator analyzes query, picks only the subagents needed. Simple lookups skip analysis. Complex queries use the full pipeline.

The Gap-Checking Loop

After synthesis, the coordinator asks: "Is this complete?" If not, it re-delegates with targeted queries and re-invokes synthesis. This iterative refinement is what separates a good system from a basic one.

Watch Out

Over-decomposition trap: Slicing a topic too narrowly means each subagent covers a tiny slice well but misses the connections between them. Decompose at the right granularity.

Key Takeaway

The coordinator is the brain. Subagents are the hands. All communication flows through the coordinator — giving you observability, error handling, and control.

03

Subagent Mechanics

The hands-on mechanics of spawning subagents, passing context, and configuring agent definitions.

The Task Tool

Rule

Subagents are spawned via the Task tool. The coordinator's allowedTools must include "Task" — otherwise it cannot spawn subagents. Each Task call creates a fresh agent with its own context window.

AgentDefinition

python{
    "description": "Searches the web for current information",
    "system_prompt": "You are a focused research agent...",
    "allowed_tools": ["web_search", "web_fetch"]
}

Each subagent gets constrained tools. The description field is how the coordinator decides which subagent to invoke — it's decision logic, not documentation.

Explicit Context Passing

✕ Broken — assumes inheritance

Task("search_agent", "Find more about the topic") — the subagent has no idea what "the topic" is.

✓ Working — explicit context

Task("search_agent", "Find recent research on lithium-ion battery recycling. Prior analysis found: [hydro, pyro]. Focus on newer methods.")

Structured Data for Attribution

json{
  "findings": [{
    "content": "Recycling rates improved 30%",
    "source_url": "https://example.com/study",
    "document": "2024 Battery Report",
    "page": 14
  }],
  "gaps_identified": ["Cost comparison data missing"]
}

Separate content from metadata so downstream agents can write summaries and provide attribution.

Parallel Spawning

Power Move

Emit multiple Task calls in a single coordinator response to run subagents concurrently. Massively faster than sequential delegation when subtasks are independent.

Coordinator Prompt Philosophy

✕ Procedural

"Step 1: Call search agent. Step 2: Call analysis. Step 3: Call synthesis."

✓ Goal-oriented

"Produce a research report with 5+ credible sources covering X, Y, Z. Delegate to specialists as needed. Verify coverage before delivering."

Key Takeaway

Context doesn't flow automatically — you pump it explicitly. Use structured formats to preserve attribution. Spawn in parallel when independent. Design coordinator prompts around outcomes, not procedures.

04

Workflow Enforcement & Handoffs

Where theory meets production. Real workflows have rules that cannot be broken and sometimes require escalation to humans.

Prompts vs. Programmatic Enforcement

Foundational Principle

Prompts are suggestions; hooks are guarantees. If you'd get fired for the 1% failure, use a hook. If it's a preference, a prompt is fine.

Prerequisite Gates

pythondef pre_tool_hook(tool_name, tool_input):
    if tool_name == "process_refund":
        if not workflow_state["customer_verified"]:
            return BLOCK("Cannot process refund: customer not verified")
        if not workflow_state["order_retrieved"]:
            return BLOCK("Cannot process refund: order not retrieved")
    return ALLOW

Claude doesn't even know this gating exists. It tries to call the tool, gets blocked with an explanation, and naturally adjusts within the agentic loop.

Multi-Concern Decomposition

Customer says: "Wrong item, promo code didn't work, and update my address." — that's three issues. A good agent decomposes them, investigates all three in parallel using shared context, then synthesizes one unified response.

Structured Handoff Summaries

json{
  "customer_id": "CUST-4521",
  "issue_summary": "Defective product, requests full refund",
  "root_cause": "Shipped from damaged batch #B-229",
  "recommended_action": "Full refund $189.99 + 15% credit",
  "reason_for_escalation": "Exceeds $100 agent authority"
}
Key Insight

Escalation isn't failure — it's a designed part of the system. The agent's job is to minimize the work the human has to do by handing off a complete picture.

Key Takeaway

Programmatic gates for rules that can't fail. Parallel investigation for multi-concern requests. Structured summaries for human handoffs.

05

SDK Hooks Deep Dive

Hooks sit between Claude and the outside world, intercepting in two directions.

┌──────────┐ Claude ──call──► │ PRE-HOOK │ ──► Tool Execution └──────────┘ ┌───────────┐ Claude ◄──result─│ POST-HOOK │ ◄── Tool Execution └───────────┘

Post-Hooks: Data Normalization

Different backends return dates as Unix epochs, ISO 8601, or human strings. Claude usually handles this, but sometimes makes comparison errors. A PostToolUse hook fixes this deterministically.

pythondef post_tool_normalize(tool_name, tool_result):
    normalized = tool_result.copy()
    # Normalize all dates to ISO 8601
    for key in ["created_at", "ship_date", "due_date"]:
        if key in normalized:
            normalized[key] = to_iso8601(normalized[key])
    # Normalize status codes to readable strings
    if "status" in normalized and isinstance(normalized["status"], int):
        normalized["status"] = status_map.get(normalized["status"])
    return normalized
Design Principle

Claude should reason about your domain, not about data format inconsistencies. Normalize field names, null handling, error formats, and date representations before Claude ever sees them.

Pre-Hooks: Policy Enforcement

pythondef pre_tool_policy_check(tool_name, tool_input):
    if tool_name == "process_refund":
        amount = float(tool_input.get("amount", 0))
        if amount > 500.00:
            return {
                "action": "BLOCK",
                "message": "Exceeds $500 limit. Escalate to human."
            }
    return {"action": "ALLOW"}

The block message is instructive — it tells Claude what to do instead. Claude receives this as the tool result and naturally adjusts.

Hook vs. Prompt Decision Framework

RequirementHookPrompt
Refund limits
Data format consistency
Tone of voice
PII redaction before logging
Suggested upsells
Blocking admin tools
Preferred resolution order
Key Takeaway

Post-hooks clean incoming data so Claude reasons clearly. Pre-hooks enforce rules so Claude can't break policy. Together they separate what Claude thinks about from what it's allowed to do.

06

Task Decomposition Strategies

How to break big problems into small ones — and knowing which pattern fits which situation.

Pattern A: Prompt Chaining (Fixed Pipeline)

Steps are known in advance. Each feeds into the next. Best for predictable workflows like code review, report generation, multi-language translation.

Input ──► Step 1 ──► Step 2 ──► Step 3 ──► Output
Code Review Pattern

Pass 1 (per-file): Each file gets focused individual analysis — full attention, no dilution. Pass 2 (cross-file): Integration pass to catch systemic patterns spanning multiple files. You'd miss things doing either pass alone.

Pattern B: Dynamic Adaptive Decomposition

Steps emerge as you go. What you discover in step 1 determines what step 2 even is. Best for investigative tasks like debugging, legacy test coverage, open-ended research.

Input ──► Discover ──► Plan ──► Execute ──► Adapt ──► ... ▲ │ └──────── new subtasks emerge ─────┘

Decision Framework

QuestionYes →No →
Do I know all steps before starting?Prompt chainingDynamic
Does step N change what step N+1 is?DynamicPrompt chaining

Real-World Mappings

TaskPattern
Summarize a doc in 3 languagesPrompt chaining
Debug why the app is slowDynamic decomposition
Review a PR against style guidelinesPrompt chaining
Migrate REST to GraphQLDynamic decomposition
Monthly report from 5 data sourcesPrompt chaining
Figure out why customers churnDynamic decomposition
Hybrid Reality

In practice, many workflows use a fixed outer pipeline where individual steps use dynamic decomposition internally. The coordinator runs the pipeline; subagents adapt within each step.

Key Takeaway

Prompt chaining when predictable. Dynamic decomposition when investigative. Split big reviews into per-unit passes + cross-unit integration. Give agents permission to adapt via goal-oriented prompts.

07

Session State & Forking

How agents maintain continuity across time and explore branching paths.

Named Session Resumption

bash# Start a named session
claude --session "auth-refactor" "Analyze the auth module for security issues"

# Resume later
claude --resume "auth-refactor" "Now suggest fixes for the top 3 issues"

When Resumption Breaks

The Stale Context Trap

If you edited files between sessions, the agent's context contains old tool results. It doesn't know things changed. Either tell it exactly what changed (small edits) or start fresh with an injected summary (extensive changes).

Option A: Inform the agent

"I updated auth.py lines 45-60 and fixed the token bug in middleware.py. Re-analyze these two files." — good for small, targeted changes.

Option B: Fresh + injected summary

Start a new session seeded with distilled findings, status, and a list of changes. More reliable when files have changed extensively.

Fork-Based Session Management

Shared baseline analysis │ ├── Fork A: "Fix using JWT with refresh rotation" │ └── Fork B: "Fix using session-based auth with Redis" Both forks see the full analysis. Neither contaminates the other.

Where Forking Shines

Use Cases

Comparing two testing strategies. Trying two refactoring approaches before committing. Exploring optimistic vs. conservative solutions. A/B testing prompt strategies for a subagent.

Decision Framework

SituationStrategy
Files mostly unchangedResume the session
Files heavily modifiedFresh session + injected summary
Want to explore 2+ approachesFork from shared baseline
Linear: S1 ──► S2 ──► S3 (resume) Branching: S1 ──► S2 ──┬── S3a (fork) └── S3b Restart: S1 ──► S2 S3(summary) (fresh + inject) ✕ ▲ stale condensed knowledge
Key Takeaway

Resume when context is fresh. Fork when exploring alternatives. Start fresh with injected summaries when the world has changed. The judgment call is always about context freshness.

Domain 1 of 5 · Claude Certified Architect — Foundation
Claude Certified Architect — Foundation

Tool Design &
MCP Integration

Domain 2 — covering tool descriptions, error handling, tool distribution, tool_choice, MCP servers, and built-in tools.

Tool Descriptions — Claude's Decision-Making Compass

Claude doesn't see your code. It reads tool descriptions to decide which tool to use. When descriptions are vague or overlapping, Claude guesses — and guesses wrong.

✗ Vague & Overlapping
"analyze_content"
"Analyzes content and returns results"

"analyze_document"
"Analyzes a document and provides analysis"
✓ Clear & Differentiated
"extract_web_results"
"Extracts structured data from search
result pages. Input: raw HTML. Output:
JSON array of {title, url, snippet}.
Use instead of summarize_content for
search engine output."
Rule of thumb: If you can't tell two tools apart by reading their descriptions alone, neither can Claude.

Anatomy of a Bulletproof Tool Description

Five ingredients separate great descriptions from bad ones.

🎯

Purpose Statement

One clear sentence on what the tool does.

📥

Input Format

What does Claude need to feed this tool?

📤

Output Format

What comes back?

⚠️

Edge Cases

What happens with weird or empty input?

🚧

Boundary Explanation

When to use this tool instead of another.

All Five in Action

"verify_claim_against_source"

PURPOSE:   "Checks whether a factual claim is supported
           by a given source document."

INPUT:    "JSON with 'claim' (string, one sentence)
           and 'source_text' (string, full text)."

OUTPUT:   "JSON with 'supported' (bool), 'confidence'
           (0.0-1.0), and 'relevant_excerpt' (string)."

EDGE:     "Returns supported=false, confidence=0.0 if
           source is empty or claim not addressed."

BOUNDARY: "Use this for fact-checking a single claim.
           Use summarize_content for overviews.
           Use extract_data_points for structured data."

Split Generic Tools Into Specialists

✗ One Generic Tool
analyze_document
Does everything vaguely — Claude
has to guess which behavior you want
✓ Purpose-Specific Tools
extract_data_points  → structured fields
summarize_content    → overviews
verify_claim         → fact-checking
💡 Mental Model

Hiring three specialists (plumber, electrician, painter) instead of one generalist. Each knows their lane — results are better across the board.


System Prompt Sabotage

Perfect tool descriptions can still fail if your system prompt contains keywords that accidentally create tool associations.

The Keyword Magnet Problem

# System prompt contains:
"Always analyze content before responding."

# Tool name happens to be:
analyze_content

# Result: Claude gravitates to this tool
# regardless of context, because the keyword
# match is strongest.

The Two-Step Audit

Step 1: Scan for Action Verbs

Find words like "analyze," "search," "check," "extract," "look up" in your system prompt. These are the magnets.

Step 2: Cross-Reference Tool Names

If any of those verbs match a tool name or description, you have a potential collision. Reword the system prompt.

✗ Keyword Collision
System prompt:
  "Always analyze content carefully..."
Tool name:
  analyze_content
✓ Neutral Language
System prompt:
  "Be thorough when processing requests..."
Tool name:
  extract_web_results
The Golden Rule: System prompts define behaviors and goals. Tool descriptions handle tool selection. Don't let them bleed into each other.

Structured Error Responses

When a tool fails, does it give Claude enough information to recover intelligently? Generic errors lead to generic (bad) recovery.

Four Error Categories

Transient

Server timeout, temporary unavailability. Retryable.

🚫

Validation

Malformed input, wrong data type. Not retryable without fixing input.

📋

Business

Policy violation, rule blocked. Not retryable.

🔒

Permission

No access to this resource. Not retryable without escalation.

Generic vs Structured Error

✗ Generic — Useless
{
  "isError": true,
  "content": "Operation failed"
}
✓ Structured — Actionable
{
  "isError": true,
  "content": "Order below $50 minimum",
  "metadata": {
    "errorCategory": "business",
    "isRetryable": false,
    "customerMessage": "Discounts
      require orders of $50+",
    "currentTotal": 32.99
  }
}

Critical Gotcha: Empty Results ≠ Errors

✗ Empty = Error (Wrong)
{
  "isError": true,
  "content": "No customers found"
}
Claude may retry or apologize for a "failure"
✓ Empty = Valid Result (Correct)
{
  "isError": false,
  "content": "0 customers matched",
  "results": []
}
Claude treats this as useful information
💡 Mental Model

Think of error responses like a doctor's diagnosis. "Something's wrong" isn't helpful. "Sprained ankle, no surgery needed, rest for two weeks" — now you know exactly what to do.


Error Recovery in Multi-Agent Systems

Who handles the error — the subagent or the coordinator? It depends on the error type.

The Principle: Handle Locally, Escalate Strategically

🔄

Subagent Handles

Transient failures — retry once or twice locally. Only escalate if still failing.

📡

Coordinator Handles

Errors the subagent can't resolve, workflow-affecting failures, skip/reroute/abort decisions.

Escalation Includes Context

When local recovery fails, the subagent sends the coordinator everything it knows:

{
  "isError": true,
  "content": "Customer API timed out after 2 retries",
  "metadata": {
    "errorCategory": "transient",
    "isRetryable": true,
    "attemptsMade": 2,
    "partialResults": null,
    "whatWasAttempted": "GET /customers/{id}"
  }
}

Preserve Partial Results

If 2 of 3 lookups succeed, send back what you got. Never discard good work because of one failure.

{
  "partialResults": {
    "customerProfile": { "name": "Jane", "tier": "gold" },  
    "orderHistory": [{ "id": "ORD-123" }],              
    "supportTickets": null                              ✗ failed
  },
  "failedStep": "supportTickets"
}
The coordinator's decision tree: Enough partial results to answer? → Deliver them. Can reroute to fallback? → Try it. Non-retryable error? → Inform user. Never silently drop information.

Tool Distribution — Less Is More

More tools means worse decisions. Each agent should only have tools relevant to its role.

✗ Overloaded — 18 Tools
synthesis_agent = Agent(
  tools=[
    "merge_findings",
    "generate_summary",
    "web_search",       # not its job!
    "fetch_url",        # not its job!
    ... 14 more ...
  ]
)
✓ Scoped — 2-3 Tools
synthesis_agent = Agent(
  tools=[
    "merge_findings",
    "generate_summary",
    "verify_fact",  # scoped cross-role
  ]
)

Scoped Tool Access by Role

AgentTools
researchweb_search, fetch_url, extract_data
analysisrun_query, compute_stats, generate_chart
synthesismerge_findings, generate_summary
coordinatordelegate_to_agent (no domain tools)

Constrained Alternatives

When a specialized agent needs limited cross-role access, replace the generic tool with a constrained version:

✗ Generic
tools=["fetch_url"]
Agent can fetch ANY URL — dangerous
✓ Constrained
tools=["load_document"]
Only accepts validated document URLs
💡 Mental Model

A kitchen: the pastry chef has whisks and piping bags. The sushi chef has knives and a bamboo mat. You could give both every tool — but the pastry chef doesn't need a sushi knife, and handing them one invites accidents.


tool_choice Configuration

Three modes that control whether and which tool Claude calls.

🤖

"auto"

Claude decides whether to call a tool and which one. Default mode — best for open-ended tasks.

"any"

Claude MUST call a tool, but picks which. Prevents conversational responses when action is needed.

🎯

Forced

Claude MUST call THIS specific tool. Guarantees a step runs — used for enforcing pipeline order.

Forced → Then Release: The Key Pattern

# Turn 1: Force metadata extraction first
response = client.messages.create(
    tools=[extract_metadata, enrich, validate],
    tool_choice={"type": "tool", "name": "extract_metadata"}
)

# Turn 2: Now let Claude choose freely
response = client.messages.create(
    tools=[enrich, validate],
    tool_choice={"type": "auto"}
)
Don't force every step. If you're forcing every single tool call, you've built a script, not an agent. Force only the steps that must happen, then let Claude reason about the rest.

Decision Framework

QuestionAnswer
Need Claude to always act?tool_choice: "any"
Specific tool must run first?Forced, then switch to "auto"
Claude choosing well on its own?tool_choice: "auto" (default)
Claude choosing poorly?Fix tool descriptions first — force as last resort

MCP Server Integration

How external tools actually connect to Claude through MCP servers — configuration, scoping, and best practices.

What's an MCP Server?

Claude Agent ←→ MCP Server ←→ External Service

A translator between Claude and external services. Tools are discovered at connection time — no manual registration needed.

Project vs User Scope

.mcp.json

Project-level. Committed to version control. Shared with the whole team. Use for Jira, GitHub, shared DBs.

~/.claude.json

User-level. Personal to you. Use for experiments and tools you're testing before proposing to the team.

Configuration with Env Var Expansion

// .mcp.json — safe to commit (no actual secrets)
{
  "mcpServers": {
    "github-server": {
      "command": "npx",
      "args": ["-y", "@mcp/server-github"],
      "env": {
        "GITHUB_TOKEN": "${GITHUB_TOKEN}"  ← from .env
      }
    }
  }
}

Community vs Custom Servers

NeedApproach
Jira / GitHub / SlackUse existing community MCP server ✓
Standard integrationCommunity server — don't reinvent it ✓
Proprietary internal toolBuild a custom MCP server ✓

Enhance MCP Tool Descriptions

Claude prefers its built-in tools (like Grep) when MCP tool descriptions are vague. The fix: tell Claude explicitly when the MCP tool is better.

// Vague — Claude ignores it, uses Grep instead
"Searches code"

// Detailed — Claude understands the advantage
"Semantic search across the full codebase using
embeddings. Finds conceptually related code even
without exact keyword matches.
Use over Grep when searching by concept."

MCP Resources: The Table of Contents

✗ Without Resources
10 exploratory tool calls just to
understand what data is available

list_issues() → 50 results
get_issue(1) → read
get_issue(2) → read
...
✓ With Resources
Browse the catalog first, then go
straight to the relevant item

resource: issue_summaries → scan
get_issue(47) → the one that matters

Built-In Tools

Six built-in tools, each with a specific sweet spot. Using the wrong one wastes time and context.

The Lineup

ToolPurpose
GrepSearch inside files — find function names, error messages, import statements
GlobSearch for files — match by name or extension pattern (e.g., **/*.test.tsx)
ReadLoad a file's full contents into context
WriteCreate or fully overwrite a file
EditSurgically modify a specific part of a file (requires unique text match)
BashRun any shell command (tests, installs, scripts)

Grep vs Glob — The Key Distinction

Grep = What's INSIDE files?
"Where is calculateTotal defined?"
grep -r "calculateTotal" src/
Glob = What files EXIST?
"Find all test files"
glob("**/*.test.tsx")

Edit vs Read+Write Fallback

Edit works by matching unique text. If the target text appears multiple times, Edit fails. Fall back to Read+Write:

# Edit works — "MAX_RETRIES = 3" is unique
Edit(file="config.py", old="MAX_RETRIES = 3", new="MAX_RETRIES = 5")

# Edit fails — "return None" appears 6 times
# Fallback: Read → modify programmatically → Write
content = Read("handlers.py")
updated = content.replace("return None  # fallback", "return default", 1)
Write("handlers.py", updated)

Incremental Codebase Understanding

Grep entry point Read that file Follow imports Read next file

Start narrow, expand as needed. Never read every file upfront — most will be irrelevant and waste your context window.

Quick Reference

I need to…Use
Find files containing "TODO"Grep
Find all .yaml config filesGlob
Understand what a file doesRead
Create a new fileWrite
Change one line in a big fileEdit
Edit fails (non-unique match)Read + Write
Run tests or install packagesBash
Map function usageGrep → Read → Grep
Claude Certified Architect — Foundation

Claude Code Configs & Workflows

Real file structures and annotated code examples for every concept

Claude Certified Architect — Foundation

Prompt Engineering
& Structured Output

Tasks 4.1 & 4.2 — Explicit criteria, false positive management, and few-shot prompting for consistent, reliable output.

Design Prompts with Explicit Criteria

The Core Problem

Vague instructions like "be conservative" or "only report high-confidence findings" give Claude no actionable framework. It still has to decide what conservative means for every finding — and it will decide inconsistently. Explicit categorical criteria always outperform confidence-based filtering.

✕ Vague Prompt

"Review this code and flag issues. Be conservative — only report high-confidence findings."

✓ Explicit Prompt

"Flag comments only when claimed behavior contradicts actual code behavior. Do NOT flag: minor style differences, local naming conventions, or TODO comments."

The Trust Cascade

False positives are poisonous — not just locally, but globally. If your code review agent has a 60% false positive rate on style suggestions, developers start ignoring everything, including accurate bug reports. High false-positive categories undermine confidence in accurate categories.

Architecture Insight

The fix for a high false-positive category isn't tweaking confidence thresholds. It's either making the criteria razor-sharp with concrete code examples, or temporarily disabling that category entirely while you improve the prompts. Restoring trust comes first.

Building Precise Criteria
Define what TO report — with code examples +

Don't just say "flag bugs." Provide concrete examples of what constitutes a bug in your context:

// FLAG — Assignment in conditional (likely a bug) if (user.age = 18) { ... } // FLAG — Unchecked null dereference const name = response.data.user.name; // when response.data could be null

Each severity level should have its own concrete examples. Abstract definitions like "high = significant impact" produce inconsistent classifications.

Define what to SKIP — equally important +

Telling Claude what not to flag is where most false positive reduction comes from:

// SKIP — Short names in reduce callbacks (idiomatic) const sum = arr.reduce((a, b) => a + b, 0); // SKIP — TODO comments (tracked elsewhere) // TODO: refactor after v2 migration // SKIP — Team-local naming conventions const usr_mgr = new UserManager();
Disable high false-positive categories temporarily +

If a category (e.g., "style suggestions") consistently produces noise, the pragmatic move is to disable it entirely while you refine the prompts.

Why? Because every false positive trains developers to click "dismiss" reflexively. You need to break that habit before re-introducing the category with better criteria. Trust is rebuilt incrementally.

Exam Heuristic

If an exam question offers "lower the confidence threshold" vs. "add explicit categorical criteria" — always pick explicit criteria. Confidence-based filtering is the distractor pattern in this domain.

Few-Shot Prompting for Consistency

Why Instructions Alone Aren't Enough

You can write the most detailed instructions imaginable — "provide location, issue, severity, and suggested fix" — and across 50 runs, you'll still get wildly different interpretations of severity, varying detail levels, and inconsistent formatting.

Instructions describe the shape of the output. Few-shot examples demonstrate the judgment.

The Mental Model

Instructions are the rules. Few-shot examples are the case law. Rules tell you what's legal. Case law shows you how ambiguous situations were actually decided.

What Makes a Good Few-Shot Example

  • 2–4 examples — enough to establish the pattern, not so many that you waste context
  • Target ambiguous cases — easy cases teach Claude nothing useful
  • Show reasoning for why one action was chosen over a plausible alternative
  • Demonstrate the exact output format you expect
  • Distinguish acceptable patterns vs. genuine issues
Code Review Agent — Two Examples
Example 1 — Flag (genuine bug)
Code: if (user.age = 18) Finding: { "location": "line 12", "issue": "Assignment in conditional", "severity": "high", "fix": "Change = to ==" } Reasoning: Single = in a conditional is almost never intentional.
Example 2 — Skip (acceptable pattern)
Code: const x = arr.reduce( (a, b) => a + b, 0 ); Finding: null Reasoning: Short variable names in reduce callbacks are idiomatic. Not a style issue worth flagging.

Example 2 Does the Heavy Lifting

The "skip" example is where the real value lives. It teaches Claude what not to flag — which directly reduces false positives. Without it, Claude will over-report acceptable patterns as issues.

Few-Shot Examples Kill Hallucination in Extraction

When extracting data from messy documents — informal measurements, varied structures, missing fields — showing Claude how to handle different document formats prevents it from fabricating values when the format doesn't match expectations.

Example use cases:

  • Inline citations vs. bibliographies — show both
  • Methodology sections vs. embedded details — show where to look
  • Informal measurements ("about 3 feet") — show how to normalize
  • Missing fields — show returning null instead of inventing data
Generalization, Not Memorization

Good few-shot examples teach Claude to generalize judgment to novel patterns, not just match the specific cases you showed. That's why showing reasoning alongside each example matters — Claude learns the decision-making process, not just the outcome.

Exam Cheat Sheet — 4.1 & 4.2

Task 4.1 — Key Takeaways

  • Explicit categorical criteria > confidence-based filtering — always
  • "Be conservative" is not a precision strategy — it's a vague instruction
  • Define what to report AND what to skip with concrete code examples
  • Each severity level needs its own concrete examples, not abstract descriptions
  • High false-positive categories poison trust in ALL categories
  • Temporarily disable noisy categories rather than tolerating false positives
  • Think "contract writing" — not "giving advice"

Task 4.2 — Key Takeaways

  • Few-shot examples are THE most effective technique when instructions produce inconsistent output
  • 2–4 examples targeting ambiguous cases — not easy ones
  • Each example shows reasoning for WHY one action was chosen
  • "Skip" examples reduce false positives more than "flag" examples
  • Few-shot enables generalization to novel patterns — not just pattern matching
  • For extraction tasks: show handling of varied document structures to prevent hallucination
  • Instructions = rules. Few-shot examples = case law.

Distractor Patterns to Watch For

  • ❌ "Lower the confidence threshold" — confidence is not the lever
  • ❌ "Add more detailed instructions" when few-shot is the better answer
  • ❌ "Increase the number of examples to 10+" — diminishing returns, wastes context
  • ❌ "Use self-review to catch false positives" — wrong architecture (that's 4.6)
  • ✓ When in doubt: explicit criteria + targeted few-shot examples
Your Known Pattern — Watch for It Here

This domain may offer distractors that collapse prompt engineering into a single API parameter (e.g., confidence_threshold: 0.9 in a constructor). Real precision comes from prompt design, not configuration flags. Same principle — different domain.

Claude Certified Architect — Foundation

Context Management
& Reliability

Domain 5 — six task statements covering context degradation, escalation logic, error propagation, codebase exploration, human review, and information provenance.

5.1a

The Three Context Threats

Context is a finite, degrading resource. Every long interaction is a fight against three forces that erode the information your agent needs to do its job.

Progressive Summarization Risk

When context grows long, the system summarizes earlier turns. Concrete transactional facts — dollar amounts, dates, order numbers — get smoothed into vague descriptions. The exact data the agent needs to resolve an issue disappears.

✗ What the summary retains

"Customer disputes a pricing discrepancy on a recent order."

✓ What was actually said

"Charged $47.99 on March 3rd for order #8812 — website showed $39.99."

The "Lost in the Middle" Effect

Models process information at the beginning and end of long inputs most reliably. Information buried in the middle of a large context window is more likely to be overlooked. If the customer's key complaint was stated in turn 14 of a 30-turn conversation, it's in the danger zone.

Position Effect

Beginning of context = high reliability. End of context = high reliability. Middle of context = degraded attention. Place critical data at the top or bottom, never buried in the middle.

Tool Result Bloat

An lookup_order call returns 40+ fields — shipping carrier tracking events, warehouse codes, internal SKU data, tax breakdowns. The agent only needed 5 fields. Those 40 fields now sit in context, consuming tokens and diluting signal.

Across a multi-turn session with several lookups, this compounds: 200 irrelevant fields vs. 25 relevant ones.

Core Principle

Context isn't just "the conversation so far." It's an engineered input that you actively curate — deciding what enters, what format it takes, and where it's positioned.

5.1b

Fighting Context Degradation

Four techniques to counter the three threats. Each targets a specific failure mode.

Technique 1: The "Case Facts" Block

Extract critical transactional data into a separate structured block included in every prompt, outside of summarized history. The conversation can be condensed — the case facts block always stays intact.

case facts=== CASE FACTS ===
Customer: Jane Martinez, ID #4421
Issue 1: Order #8812 — charged $47.99, expected $39.99, placed March 3
Issue 2: Order #8840 — return requested, status: delivered March 7
Escalation: None requested
=== END CASE FACTS ===

For multi-issue sessions, this becomes a structured context layer — a living document that grows as new issues surface.

Pattern — Persistent Context Layer

Structured issue data (order IDs, amounts, statuses) lives in a separate layer from conversation history. History can be summarized aggressively because the facts are protected externally.

Technique 2: Trimming Tool Outputs

Post-process tool results before they enter context. Keep only the fields relevant to the agent's current role. A 40-field order object becomes 5 fields of signal.

✗ Raw tool output

40 fields: warehouse routing codes, internal flags, tax line items, carrier tracking events, SKU metadata…

✓ Trimmed output

5 fields: order ID, item total, order date, status, return eligibility

Technique 3: Strategic Positioning

Counter the lost-in-the-middle effect with deliberate placement. Key findings summaries go at the beginning of aggregated inputs. Detailed results use explicit section headers. The case facts block sits at the top of the prompt.

Technique 4: Structured Subagent Outputs

Upstream subagents return structured data (key facts, citations, relevance scores) instead of verbose content and reasoning chains. Downstream agents with limited context budgets receive clean, compact inputs — not paragraphs of reasoning to parse.

Exam Rule — Metadata Requirement

Subagents must include metadata in structured outputs: dates, source locations, and methodological context. This isn't optional — it supports accurate downstream synthesis (see Task 5.6).

Throughline

Every technique here is about the same thing — being intentional about what enters context, what format it's in, and where it's positioned.

5.2

Escalation & Ambiguity Resolution

Escalation isn't about difficulty — it's about authority boundaries. The curriculum defines three specific triggers.

The Three Legitimate Triggers

Trigger When It Fires Key Nuance
Customer requests human Explicit demand for a person Non-negotiable hard rule — honor immediately
Policy exception / gap Policy is silent or ambiguous on the request Don't guess by analogy — silence = gap = escalate
No meaningful progress Agent has tried and is genuinely stuck Not "this is complex" — but "I cannot move forward"
What's NOT a Trigger

Complexity alone. A case can be complex but fully within the agent's capability. Escalating because something has multiple steps wastes human capacity.

The "Let Me Help" vs. "Right Away" Distinction

Soft request: Customer says "I want to talk to a manager." The issue is a simple return. You get one offer to resolve. If they reiterate, escalate immediately. No second attempts.

Hard request: Customer says "Get me a human NOW." Escalate immediately. No investigation, no "let me just check." Emphatic demands are honored without delay.

Hard Rule — Explicit Escalation

When a customer explicitly demands a human, this is a non-negotiable hard rule. Always honor it immediately. This is NOT conditional on the agent's assessment of the case.

Unreliable Escalation Proxies

✗ Sentiment-based escalation

"If customer sounds angry, escalate." Angry customers may have simple issues. Calm customers may have impossible ones. Tone ≠ complexity.

✗ Self-reported confidence scores

"If model confidence < 0.7, escalate." Claude isn't well-calibrated at rating its own certainty. High confidence can coexist with wrong answers.

The Policy Gap Pattern

Your policy covers price matching against your own site's pricing errors. Customer asks for a competitor price match. The policy says nothing about competitors. The agent should not apply existing policy by analogy — silence in the policy is a gap, and gaps belong to humans.

Multiple Customer Matches

Agent searches and finds two "Jane Martinez" accounts. The correct response is to ask for an additional identifier (zip code, email), not to pick one based on heuristics like "most recent order."

Pattern — Clarify, Don't Guess

When uncertain — whether about policy, customer identity, or case facts — ask for clarification rather than making a heuristic selection. This applies to ambiguous tool results, ambiguous customer requests, and ambiguous policy interpretation.

Throughline

Escalate when you lack the authority (policy gaps), the ability (no progress), or the permission (customer demands it). Everything else, resolve.

5.3

Error Propagation in Multi-Agent Systems

When a subagent fails, the coordinator needs enough context to make a smart decision — retry, fall back, proceed with partial results, or surface the gap to the user.

The Two Anti-Patterns

✗ Silent Suppression

Inventory subagent times out → returns empty result as if it succeeded → coordinator tells customer "item out of stock" → but the item isn't out of stock, the system just didn't respond. False information delivered.

✗ Total Termination

Inventory subagent times out → entire workflow halts → order lookup and return check both succeeded but their results are discarded → customer gets "something went wrong." Useful work wasted.

Access Failure vs. Valid Empty Result

This is the foundational distinction. A subagent gets nothing back — but why?

Type What Happened Coordinator Action
Access failure Timeout, 500 error, connection dropped — query never completed Retry, use fallback, inform customer of temp issue
Valid empty result Query succeeded normally, no matching records exist Confidently tell customer no record exists

If the subagent returns "no results" for both cases, the coordinator can't distinguish them. Generic error statuses like "service unavailable" strip away decision-critical context.

Structured Error Context

json{
  "status": "error",
  "failure_type": "timeout",
  "what_was_attempted": "inventory check for SKU #4419",
  "partial_results": null,
  "retry_recommended": true,
  "alternatives": "check warehouse B system directly"
}
Pattern — Local Recovery Before Propagation

Subagents handle transient failures locally (retry once or twice). Only propagate errors they cannot resolve — including what was already attempted and any partial results. This prevents redundant coordinator retries.

Coverage Annotations in Synthesis

When the coordinator assembles a response and one subagent failed, the output flags the gap explicitly: "Your order #8812 is eligible for return. Note: I wasn't able to check current inventory for a replacement — I can check again or connect you with a team member."

Exam Tip — Preserve Partial Results

On failure, return structured error objects with partial results and coverage annotations — not silent failures. This is a recurring principle across the certification: always preserve work that already succeeded.

Throughline

Errors are data, not just failures. A well-designed system treats every error as a decision point for the coordinator — with enough structured context to decide intelligently.

5.4

Context in Codebase Exploration

Same degradation problems, different domain. Claude Code explores a massive codebase — thousands of files, complex dependency chains. The context window fills with verbose discovery output, and early findings get lost.

Symptom — Context Degradation

Early in the session, the agent references specific classes it discovered. Later, it starts talking about "typical patterns" instead — generic training knowledge replacing its own findings. The context filled up and the discoveries were pushed out.

Technique 1: Scratchpad Files

The agent maintains a file on disk recording key findings as it goes. When it needs to answer a question later, it reads the scratchpad first. Findings are on disk — not dependent on surviving in the context window.

scratchpad=== SCRATCHPAD: Project Architecture ===
Entry point:      src/main/App.java
Refund logic:     src/services/RefundCalculator.java
Dependencies:     PricingEngine, TaxService, InventoryClient
Tests:            tests/services/RefundCalculatorTest.java (47 cases)
Config:           refund thresholds in config/policies.yaml

Technique 2: Subagent Delegation

Instead of the main agent grepping through hundreds of files (filling its own context), it spawns subagents with isolated context windows:

Coordinator (clean context — high-level map) ├─→ Subagent A: "Find all test files related to refunds" ├─→ Subagent B: "Trace RefundCalculator dependency chain" └─→ Subagent C: "Identify API endpoints that trigger refunds" │ └─→ Each returns concise structured summary (verbose exploration stays contained)

Technique 3: Phase-Based Summarization

Summarize findings from Phase 1 before spawning subagents for Phase 2. Inject those summaries into the new subagents' initial context. They start with knowledge, not from scratch.

Technique 4: Crash Recovery with State Manifests

Each agent periodically exports its state to a known file location. A coordinator-level manifest tracks completion status across all agents. On resume, the coordinator loads the manifest and injects it into the restarted agent's prompt.

manifest=== RECOVERY MANIFEST ===
Phase:       dependency_mapping
Completed:  [architecture_scan, test_discovery]
In Progress: refund_flow_trace (agent_3)
Pending:    [api_endpoint_mapping, config_analysis]
Findings:   /scratchpad/findings.md

Technique 5: /compact

A practical Claude Code command. When context fills with verbose discovery output, /compact compresses it — reducing context usage so the session can continue. A pragmatic escape valve, not a replacement for the other techniques.

Throughline

Scratchpads, subagent isolation, phase summaries, and manifests are all ways of externalizing knowledge so it doesn't depend on surviving inside the context window.

5.5

Human Review & Confidence Calibration

At scale, you need to know when to trust the model's output and when to route it to a human reviewer. Aggregate accuracy metrics can mask critical failures.

The Aggregate Metric Trap

Hidden Failure

97% overall accuracy sounds great. But: 99.5% on standard invoices (80% of volume), 82% on handwritten purchase orders, 74% on international documents. Automating based on the aggregate number lets 1 in 4 international documents through with errors.

You must validate accuracy by document type and field segment — not just overall. A single number can't tell you where the system is reliable and where it isn't.

Stratified Random Sampling

Pull samples from each document type and field category, including high-confidence extractions. This catches calibration drift (model becomes overconfident on new formats) and detects novel error patterns not in the original validation set. This is ongoing, not one-time.

Field-Level Confidence Scores

Instead of one score per document, the model outputs a score per field. A low score on "total amount" routes just that field to human review, even if the rest looks fine.

Critical Step — Calibration

Confidence thresholds must be calibrated using labeled validation sets. You don't pick 0.85 as a cutoff because it "feels right." You measure where the model's scores actually correlate with accuracy and set thresholds from that empirical evidence.

Routing Review Attention

Condition Route
Low model confidence → Human review
Ambiguous / contradictory source → Human review
High confidence, calibrated segment → Automated + stratified sampling safety net
Throughline

Without calibration, a confidence score is just a number. With calibration, it's a routing decision. Aggregate metrics can lie — stratify by segment to find the truth.

5.6

Provenance & Multi-Source Synthesis

When multiple subagents gather information from different sources, the synthesis agent must combine it without losing track of where things came from.

The Provenance Problem

Subagent A: market size is $4.2B (industry report, 2023). Subagent B: market size is $3.8B (government dataset, 2021). Synthesis agent writes: "The market size is approximately $4 billion."

Three failures: source attribution gone, temporal context gone, conflict silently resolved by averaging.

Structured Claim-Source Mappings

Each subagent returns findings tied to their source — URL, document name, publication date, methodology. The synthesis agent must preserve these mappings through the final output.

json{
  "claim": "Market size is $4.2B",
  "source": "Gartner Industry Report 2023",
  "source_url": "https://...",
  "publication_date": "2023-06-15",
  "methodology": "Survey of 500 enterprises"
}

Conflicts: Annotate, Don't Arbitrate

Hard Rule — Conflict Handling

When credible sources disagree, include both values with explicit annotation. The subagent does not pick a winner. The coordinator or end user decides how to reconcile. Conflicts are preserved, never hidden.

Temporal Context Prevents False Contradictions

The $3.8B (2021) and $4.2B (2023) figures might not conflict at all — they could reflect market growth. Without publication dates in the structured output, the synthesis agent treats them as contradictory claims about the same point in time. Requiring dates in subagent outputs prevents this specific class of misinterpretation.

Rendering by Content Type

The synthesis output should match the nature of the content: financial data as tables, news as prose, technical findings as structured lists. Don't flatten everything into one uniform format — that loses information and readability.

Report Structure

Pattern — Well-Established vs. Contested

Structure reports with explicit sections distinguishing well-established findings from contested ones. Preserve original source characterizations and methodological context. The reader should always know what's settled and what's debated.

Throughline

Provenance is the chain of custody for every fact. If you can't trace a claim back to its source, with its date and methodology, the synthesis is unreliable — no matter how polished the final output looks.