Mada Tools

Executive Summary

The Architectural Transition to System-Centric AI

In the modern era of foundation models, traditional “model-centric” engineering has been superseded by system-centric AI engineering. AI systems are now built by dynamically linking, wrapping, and augmenting pre-trained foundation models.

The ultimate challenge in production is not model capability, but building deterministic guarantees around highly probabilistic text predictors. This requires structured evaluation-driven frameworks, real-time memory optimizations, and careful design of context architectures.

3 Layers

Modern Stack

10 Chapters

Operational Guides

Evaluation

Core Paradigm

The Core Thesis

“FMs are a programmable software platform. However, because they are probabilistic rather than deterministic, standard integration methods fail. Building reliable applications requires wrapping models in robust validation architecture — including routers, dual guardrails, prompt monitors, and semantic caching.”

Deterministic Input Preprocessing

Probabilistic Model Evaluation

Guardrailed Structured Outputs

AUTHOR BIOGRAPHY

Chip Huyen (Co-founder, Claypot AI & Instructor at Stanford)

System Architecture & Pipeline Mindmap

Click on the nodes to reveal each component's strategic role and design parameters.

3-Layer Stack Model

NODE SELECTOR

Click a node on the map

Select any component to reveal its strategic role, trade-offs, and design parameters as defined by Chip Huyen's architectural thesis.

DESIGN ADVICE

Maximize systemic modularity before changing weights.

Interactive Architecture Design

The Model Adaptation Wizard

Determine whether to use Prompting, RAG, or Fine-tuning based on your operational constraints.

1. What are your real-time knowledge/data requirements?

2. What are your requirements for domain-specific formatting and output control?

3. What is your engineering budget and hardware capacity?

4. What are your end-to-end latency constraints?

RECOMMENDED PATTERN

Prompt Engineering

Based on your constraints, starting with pure Prompt Engineering is the most logical path. It avoids complex infrastructure while proving system viability.

Alternative Pathway

RAG if dynamic context is needed later.

Primary Bottleneck Risk

Fragility across base model weight revisions.

Huyen's Advice:Start simple!

Compute & Hardware Mathematics

Inference Memory & KV Cache Calculator

Calculate the GPU VRAM footprint required to run your foundation models in production.

Model Parameters (Billion)

1B8B180B

Precision (Bits per weight)

Concurrent Batch Size (B)

11128

Max Sequence Context (S)

5124,096 tokens32k

Transformer Layers (L)

KV Heads — GQA / MQA (H)Typically 8 for Llama-3-8B (Grouped Query Attention).

HARDWARE FOOTPRINT

Model Weights VRAM

4.00 GB

KV Cache Memory (Peak)

0.54 GB

2 × B × L × H × 128 × S × 2 bytes

Total VRAM Needed

4.54 GB

Can run on consumer hardware (RTX 3060/4060 or MacBook M-series).

Masterclass Resource

Chapter-by-Chapter Explorer

Click to expand any chapter and study concepts, analogies, and system formulas.