Anthropic Production Best Practices

The Prompting Playbook

A complete engineering framework to transition LLM prompts safely from prototypes to hardened production systems. Learn how to systematically diagnose degradations, leverage external tools, handle hidden trade-offs, and construct multi-agent self-repair systems.

Presenter: Margot Vanlar (Applied AI Engineer, Anthropic London)

Framework Core: Rigorous Evaluation & Agentic Design

Context: Code with Claude Masterclass

Playbook Architecture Modules

1. Evaluation Rigor

The core scientific compass. Establishing a multi-tier test suite to evaluate regressions, control variables, edge behaviors, and systemic limits before code changes.

2. Prompt Hygiene

Structural cleanups using XML formatting contracts to split roles, policies, guidelines, data inputs, and stop tokens into isolated blocks contextually parsable by LLMs.

3. Deflection Mapping

Eliminating overfitted legacy patches and defensive prompt hacks. Replacing hidden vulnerabilities with explicit trade-off instructions and runtime validation.

4. Agentic Self-Repair

Decomposing monolithic instructions into independent loops. Leveraging a modular Generate-Evaluate-Repair pattern to scale complex reasoning tasks.

Note

Treat every prompt change like a production code change: isolate the hypothesis, run the eval suite, inspect failures, and only then decide whether to patch the prompt, add a tool, or decompose the workflow.

The Foundation: Systematic Evaluation Suites

Requirement

Moving prompts into production or migrating architectures without structured telemetry leads to complete failure blindness. When a model migration causes production degradation, evaluations determine whether the new model behaves differently but is tuneable, or if it lacks the inherent baseline capability required.

The Three Pillars of an Effective Eval Suite

1. Control Cases

Unambiguous, standard requests that the system should pass consistently. Used to maintain baseline stability and detect global performance regressions.

2. Edge Cases

Scenarios historically verified to trigger failures. Including explicit historical failure modes in your test matrices ensures past bugs do not slip back into active builds.

3. Capability Boundaries

Hard checkpoints built to verify if the LLM precisely understands the extent of its capabilities. Teaches the system when to refuse answers or hand off workflows to a human agent.

Case Setup: Meridian Mobile Customer Support Bot

A miniature real-world production evaluation script utilizing 5 targeted test assertions to debug an enterprise customer relations agent:

Test ID	Case Category	User Scenario / Inquiry	Expected Guardrail/Assertion
TC-001	Control Case	“What's the data limit in the basic plan?”	Accurately read static database entries without formatting variance.
TC-002	Edge Case (Math)	“What if I switch my plan halfway through the month? What will my bill look like?”	Perform explicit proration calculations down to deterministic line items.
TC-003	Policy Guardrail	Complex internal account policies.	Adhere directly to corporate legal policies without improvising instructions.
TC-004	Boundary Handoff	Direct, irresolvable billing conflict or customer system discrepancy.	Stop automated diagnosis and cleanly escalate the ticket to a human manager.
TC-005	Information Withholding	Legacy subscriber query: “How much hotspot data is on my unlimited plan?”	Check grandfathered user criteria rather than applying generic active rules.

The Production Playbook: Debugging and Optimization

Strategic execution

When running a multi-collaborator prompt long-term, prompts grow complex, full of legacy fixes, unowned patches, and messy paragraphs. Approaching optimizations sequentially ensures systematic issue diagnosis.

Phase 1: General Hygiene and Structural XML ScaffoldingV0 Eval: Fails 4/5 cases

Identified Prompt Anti-Patterns:

Persona Lies: Instructing the bot that it is a physical living human support agent (untrue, causes logic breaks).
Redundant Technical Bloat: Raw text scraped directly from web source code including metadata, hero images, cookies, and tracking pixel text.
Monolithic Text Blobs: Shuffling corporate role definitions, guidelines, technical constraints, data payloads, and tone criteria into a single paragraph.

Engineering Best Practice:

If you cannot cleanly isolate guidelines from active runtime data when skimming the text, the model cannot separate them either. Enforce semantic structure with strict XML tagging blocks.

<role> You are an automated helper... </role> <general_guidelines> ... </general_guidelines> <policy> ... </policy> <tone_of_voice> ... </tone_of_voice>

Systematic Impact: Instantly resolves baseline classification and prepaid plan test variants.

Phase 2: Output Contracts & Harness ConstraintsStructural Security

Enforcing consistent output constraints requires structural reinforcement at the prompt boundary and the runtime pipeline. Forcing structured arrays or JSON shapes via prompting alone can lead to syntax bugs downstream.

Prompt Boundaries: Mandate precise opening and closing wrap tokens (e.g., instructing the system to dump all outputs inside custom <response> tags).
API Harness Controls: Inject a literal Stop Sequence array parameter directly into the API request payload (e.g., stopping the generation the moment </response> is generated) to prevent token streaming waste.
Guaranteed Enforcement: For highly complex structural configurations, prioritize developer-centric native API features like Structured Outputs to guarantee compliance.

Phase 3: Deep-Diving Isolated Deflection Failures

Applying tactical remedies to the three critical design deadlocks discovered inside Meridian Mobile's evaluation framework:

Failure Mode A

The Hotspot Paradox (Withholding Information)

The system completely refused to answer a grandfathered user's inquiries regarding account metrics, redirecting them to an external URL link instead.

- "Never give a customer the wrong details; point them to the URL instead." [Legacy Patch]+ "Grandfathered users have unique terms. Analyze the ingested context payload block directly as your singular absolute source of truth."

Root Cause: Overfitting an aggressive defensive patch built for an older, less capable model. Newer generation LLMs possess superior instruction-following tracking, meaning legacy patches often cause them to withhold valid data.

Fix Strategy: Use prompt version control to document why defensive instructions are added so they can be pruned as models improve.

Failure Mode B

Instructions ≠ Capability (Proration Calculation)

The agent responded with vague estimates or incorrect values when asked to calculate mid-cycle subscription change charges.

- "It is critical that you always calculate all prorated statements correctly!"+ [Inject tool definition schema via API invocation]+ "Whenever calculating bill items, utilize the calculate_proration tool."

Root Cause: Telling an LLM to “do a good job” or “be accurate” does not magically give it new capabilities like mental math. Instructions cannot overcome foundational reasoning limits.

Fix Strategy: Offload complex operations to deterministic code functions by registering structured tool schemas.

Failure Mode C

One-Sided Trade-offs (Billing Conflict Refusals)

The bot actively avoided escalating clear billing errors to humans, instead arguing with customers and attempting to diagnose errors itself.

- "Avoid transferring to support specialists. Every escalation costs our team $8 and hurts our resolution metrics."+ "Escalations cost $8. However, failing to resolve a valid system billing error costs us an automatic refund and permanently destroys customer trust. Balance both factors."

Root Cause: The prompt only outlined the penalty of escalating, causing the agent to overfit and avoid it at all costs. Advanced models need to understand both sides of a business trade-off to make intelligent contextual balancing decisions.

Building New Agentic Workflows from Scratch

Architecture 0-to-1

When engineering highly complex, constraint-heavy automated use cases (e.g., generating an enterprise-grade retail workforce schedule for 8 distinct staff members governed by strict availability, fatigue rules, and shift demand requirements), a single monolithic prompt will often fail. Optimizing the combination of prompt design, model size, and systemic architecture helps find the right balance between cost and latency.

The Optimization Journey: Five Systematic Experiments

The engineering team ran iterative tests to evaluate performance across 5 different implementation approaches, tracking constraint violations via a programmatic Python validation script:

Trial	Architecture Setup	Model Target	Reliability Outcome	Token / Latency Trade-offs
1	Baseline Simple Prompt Clean structure, basic schema constraints.	Sonnet 4.6	100% Failure Rate High constraint violations across all runs.	Low relative token draw; minimal useful output.
2	Baseline Simple Prompt Identical prompt configuration.	Opus 4.7	100% Failure Rate Total constraint violations dropped significantly, but still failed to pass.	Increased internal reasoning overhead.
3	Adaptive Thinking Run Allow model to scale its internal runtime reasoning.	Opus 4.7	100% Success Rate Reliably generated valid, compliant schedules.	Severe Cost Penalty: Tripled token consumption and tripled response latency (~100s).
4	Advanced “Check Work” Prompt Prompt instruction heavily padded with self-checking commands.	Sonnet 4.6	60% Failure Rate Passed 2/5 trials. Hit generation max-token ceilings before completing tasks.	High token waste due to repetitive internal generation cycles.
5	Agentic Loop (Decomposed Flow) Split into 3 simple, independent, targeted prompts.	Sonnet 4.6	100% Success Rate Passed all test cases consistently.	Optimal Mix: Lower combined token volume and significantly faster execution than Trial 3 or 4.

The Decomposed Agentic Architecture: Generate-Evaluate-Repair Loop

Instead of forcing a single prompt to balance layout creation, rules checking, and error correction simultaneously, isolate individual responsibilities into an autonomous state-machine loop:

The Generator Specialist

Ingests worker availability inputs and quickly constructs an initial baseline draft of the 7-day employee schedule shift array.

The Evaluator Specialist (LLM Judge)

An independent prompt that audits the generated draft line-by-line against a checklist of operational rules. It logs any errors found and compiles specific contextual evidence for each violation.

The Repairer Specialist

Receives the original draft along with the Evaluator's structured error log. It applies targeted corrections to resolve the noted violations before passing the updated draft back to the Evaluator.

The Operational Advantage of LLM-Based Loops

While strict rules can be verified using rigid backend code scripts (e.g., a native Python function checking matching variables), migrating evaluation to an LLM-driven loop offers a major business advantage:

Dynamic Soft Constraints Handling: Enables non-technical staff to inject changing ad-hoc rules at runtime (e.g., “Harry and Sally shouldn't work the same shift this week,” or “Add an extra backup shift on Wednesday night”) without requiring a software engineer to rewrite the core application source code.