AI Engineering O'REILLY 2025

Masterclass Synthesis & Guided Infographics by Chip Huyen

Decision WizardVRAM Calc
Executive Summary

The Architectural Transition to System-Centric AI

In the modern era of foundation models, traditional “model-centric” engineering has been superseded by system-centric AI engineering. AI systems are now built by dynamically linking, wrapping, and augmenting pre-trained foundation models.

The ultimate challenge in production is not model capability, but building deterministic guarantees around highly probabilistic text predictors. This requires structured evaluation-driven frameworks, real-time memory optimizations, and careful design of context architectures.

3 Layers

Modern Stack

10 Chapters

Operational Guides

Evaluation

Core Paradigm

The Core Thesis

“FMs are a programmable software platform. However, because they are probabilistic rather than deterministic, standard integration methods fail. Building reliable applications requires wrapping models in robust validation architecture — including routers, dual guardrails, prompt monitors, and semantic caching.”

Deterministic Input Preprocessing
Probabilistic Model Evaluation
Guardrailed Structured Outputs
AUTHOR BIOGRAPHY

Chip Huyen (Co-founder, Claypot AI & Instructor at Stanford)

System Architecture & Pipeline Mindmap

Click on the nodes to reveal each component's strategic role and design parameters.

3-Layer Stack Model
Core FMPromptingRAGFinetuneContinuous EvalDatabasesSafety Gateway

NODE SELECTOR

Click a node on the map

Select any component to reveal its strategic role, trade-offs, and design parameters as defined by Chip Huyen's architectural thesis.

DESIGN ADVICE

Maximize systemic modularity before changing weights.

Interactive Architecture Design

The Model Adaptation Wizard

Determine whether to use Prompting, RAG, or Fine-tuning based on your operational constraints.

RECOMMENDED PATTERN

Prompt Engineering

Based on your constraints, starting with pure Prompt Engineering is the most logical path. It avoids complex infrastructure while proving system viability.

Alternative Pathway

RAG if dynamic context is needed later.

Primary Bottleneck Risk

Fragility across base model weight revisions.

Huyen's Advice:Start simple!
Compute & Hardware Mathematics

Inference Memory & KV Cache Calculator

Calculate the GPU VRAM footprint required to run your foundation models in production.

1B8B180B
11128
5124,096 tokens32k
Typically 8 for Llama-3-8B (Grouped Query Attention).
HARDWARE FOOTPRINT
Model Weights VRAM

4.00 GB

KV Cache Memory (Peak)

0.54 GB

2 × B × L × H × 128 × S × 2 bytes
Total VRAM Needed

4.54 GB

Can run on consumer hardware (RTX 3060/4060 or MacBook M-series).

Masterclass Resource

Chapter-by-Chapter Explorer

Click to expand any chapter and study concepts, analogies, and system formulas.