Understanding Transformers: From First Principles to Production GPTs
Richard Feynman once said, "What I cannot create, I do not understand." Andrej Karpathy took that literally — spending a decade distilling GPTs down to their bare essence, ultimately fitting the entire algorithm into 200 lines of pure Python. No PyTorch. No NumPy. Just math.
This post follows that same philosophy. We start from the simplest possible idea — predicting the next thing in a sequence — and build, layer by layer, until we have a working Transformer. Along the way, we trace the legendary Karpathy progression: micrograd → makemore → minGPT → nanoGPT → microGPT → nanochat, where each project peels away one more layer of abstraction.
By the end, you won't just understand Transformers. You'll understand them the way Feynman would have wanted — by building them yourself.
Part 1: The Simplest Possible Idea
What Are We Actually Doing?
Forget "artificial intelligence" for a moment. Here's the actual problem:
Given some sequence of symbols, predict what comes next.
That's it. That's the entire game. When you type "The capital of France is" and GPT responds "Paris", the model has simply learned that in the ocean of text it trained on, the token "Paris" tends to follow that particular sequence of tokens.
The Feynman Reframe
A GPT is not a "thinking machine." It's a function that maps a sequence of tokens to a probability distribution over the next token. The "intelligence" is a byproduct of learning statistical patterns across billions of documents. The mechanism is fully contained in the math below.
Starting Point: The Bigram Model
The absolute simplest language model counts pairs. Given the character "h", how often does "e" come next? How often does "a" come next? The bigram model is just a lookup table of these frequencies.
# The simplest possible language model: bigrams
import random
from collections import Counter
text = "hello world hello there"
chars = sorted(set(text))
stoi = {c: i for i, c in enumerate(chars)} # char -> int
itos = {i: c for c, i in stoi.items()} # int -> char
# Count every pair of consecutive characters
bigrams = Counter()
for w in text:
for ch1, ch2 in zip(w, w[1:]):
bigrams[(ch1, ch2)] += 1
# Now we can "generate" by sampling from the distribution
# after each character. This is a language model!This is dumb, but it's already the core idea. Every improvement from here — attention, multi-head attention, residual connections, layer norm — is just a better way to model the conditional probability .
The bigram model has a fatal flaw: it only looks at one previous token. Real language has long-range dependencies. "The cat that sat on the mat was sleeping" — the verb "was" depends on "cat", not "mat". We need a model that can look at the entire context.
Part 2: Teaching Machines to Look Back — The Attention Mechanism
The Problem With Looking at Everything
Suppose we have a sequence of tokens. We want each token to "gather information" from all previous tokens. But how much information should it gather from each one?
This is the question self-attention answers.
The Simplest Attention: Averaging
The crudest version: just average all previous token representations.
import torch
# Suppose we have 4 tokens, each represented by a 2D vector
x = torch.tensor([
[1.0, 0.0], # token 0
[0.0, 1.0], # token 1
[1.0, 1.0], # token 2
[0.5, 0.5], # token 3
])
T = x.shape[0] # sequence length
# Version 1: naive averaging with a loop
out = torch.zeros_like(x)
for i in range(T):
out[i] = x[:i+1].mean(dim=0) # average all tokens up to position i
print(out)
# token 0 sees only itself
# token 1 sees average of tokens 0,1
# token 2 sees average of tokens 0,1,2
# token 3 sees average of all fourThis works but is terrible — every past token gets equal weight. The word "cat" should matter more than "the" when predicting the verb. We need weighted averaging, where the model learns which tokens to attend to.
The Matrix Multiply Trick
Here's a beautiful insight Karpathy emphasizes in his Zero to Hero lectures. That averaging loop above? It's secretly a matrix multiply:
# Version 2: the same thing as a matrix multiply
import torch.nn.functional as F
# Create a lower-triangular matrix of ones
tril = torch.tril(torch.ones(T, T))
# Normalize each row to sum to 1
weights = tril / tril.sum(dim=1, keepdim=True)
print(weights)
# tensor([[1.0000, 0.0000, 0.0000, 0.0000],
# [0.5000, 0.5000, 0.0000, 0.0000],
# [0.3333, 0.3333, 0.3333, 0.0000],
# [0.2500, 0.2500, 0.2500, 0.2500]])
out2 = weights @ x # matrix multiply!
# out2 is identical to out from the loop versionThe lower-triangular structure enforces causality — token 2 can only attend to tokens 0, 1, and 2 (not future token 3). This is the "masked" in masked self-attention.
Now the key leap: what if those weights weren't uniform? What if the model could learn which tokens to pay attention to?
Version 3: Learned Attention with Q, K, V
This is where the magic happens. Each token produces three vectors:
- Query (Q): "What am I looking for?"
- Key (K): "What do I contain?"
- Value (V): "What information should I pass along?"
The attention score between tokens and is just the dot product of token 's query with token 's key. High dot product = high relevance = pay more attention.
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
torch.manual_seed(42)
B, T, C = 1, 4, 8 # batch, time (sequence length), channels (embedding dim)
x = torch.randn(B, T, C)
head_size = 4
key = nn.Linear(C, head_size, bias=False)
query = nn.Linear(C, head_size, bias=False)
value = nn.Linear(C, head_size, bias=False)
k = key(x) # (B, T, head_size)
q = query(x) # (B, T, head_size)
v = value(x) # (B, T, head_size)
# Compute attention scores: how much should each token attend to each other?
# This is Q @ K^T — a dot product between every pair of query and key vectors
weights = q @ k.transpose(-2, -1) # (B, T, T)
# Scale to prevent large values (which would make softmax too "peaky")
weights = weights / math.sqrt(head_size)
# Mask: prevent attending to future tokens (causal / autoregressive)
tril = torch.tril(torch.ones(T, T))
weights = weights.masked_fill(tril == 0, float('-inf'))
# Softmax: convert scores to probabilities (each row sums to 1)
weights = F.softmax(weights, dim=-1)
# Weighted sum of values
out = weights @ v # (B, T, head_size)Feynman Analogy: The Library
Imagine a library. You walk in with a question (your Query). Every book on the shelf has a title (its Key) and contents (its Value). You compare your question against every title — the ones that match best, you read more carefully. The final answer is a weighted blend of the contents of all relevant books.
The Attention Formula
Everything above collapses into one beautiful equation:
Let's unpack each piece:
| Component | What It Does | Analogy |
|---|---|---|
QKᵀ | Compute relevance scores between all query-key pairs | "How well does my question match each book title?" |
√dₖ | Scale factor to prevent dot products from exploding | "Normalize the scoring so it doesn't get too extreme" |
softmax | Convert scores to probabilities (0 to 1, summing to 1) | "Turn raw scores into a proper attention distribution" |
× V | Weighted sum of value vectors using those probabilities | "Blend the book contents according to relevance" |
Why the √dₖ Scaling?
This is one of those details that seems arbitrary but matters a lot. When the dimension is large, dot products tend to grow proportionally (roughly like in magnitude). Large dot products push softmax into regions where its gradients are tiny — the model can barely learn. Dividing by keeps the variance of the dot products at approximately 1, keeping softmax in its "useful" region.
# Demonstrating why scaling matters
d_k = 64
q = torch.randn(1, d_k)
k = torch.randn(100, d_k)
# Without scaling: dot products have high variance
unscaled = q @ k.T
print(f"Unscaled std: {unscaled.std():.2f}") # ~8.0 for d_k=64
# With scaling: variance ≈ 1
scaled = unscaled / math.sqrt(d_k)
print(f"Scaled std: {scaled.std():.2f}") # ~1.0
# Softmax on unscaled: almost one-hot (extreme)
print(F.softmax(unscaled, dim=-1).max()) # ~0.99
# Softmax on scaled: smooth distribution (learnable)
print(F.softmax(scaled, dim=-1).max()) # ~0.05Part 3: Multi-Head Attention — Looking at Multiple Things at Once
A single attention head learns one type of relationship. But language has many simultaneous patterns: syntactic structure, semantic meaning, positional proximity, coreference. Multi-head attention runs several attention heads in parallel, each learning different patterns.
where each head computes:
In code, this is elegant — we reshape the embedding into multiple "heads" and run attention on each independently:
class MultiHeadSelfAttention(nn.Module):
def __init__(self, embed_dim: int, num_heads: int):
super().__init__()
self.num_heads = num_heads
self.head_dim = embed_dim // num_heads
assert self.head_dim * num_heads == embed_dim, "embed_dim must be divisible by num_heads"
# Single linear projection for Q, K, V (more efficient than three separate ones)
self.qkv = nn.Linear(embed_dim, 3 * embed_dim, bias=False)
self.out_proj = nn.Linear(embed_dim, embed_dim, bias=False)
def forward(self, x: torch.Tensor) -> torch.Tensor:
B, T, C = x.shape
# Project to Q, K, V and reshape into multiple heads
qkv = self.qkv(x) # (B, T, 3*C)
qkv = qkv.reshape(B, T, 3, self.num_heads, self.head_dim)
qkv = qkv.permute(2, 0, 3, 1, 4) # (3, B, num_heads, T, head_dim)
q, k, v = qkv[0], qkv[1], qkv[2]
# Scaled dot-product attention (per head, in parallel)
scale = math.sqrt(self.head_dim)
attn_scores = (q @ k.transpose(-2, -1)) / scale # (B, H, T, T)
# Causal mask: prevent attending to future tokens
mask = torch.tril(torch.ones(T, T, device=x.device))
attn_scores = attn_scores.masked_fill(mask == 0, float('-inf'))
attn_weights = F.softmax(attn_scores, dim=-1) # (B, H, T, T)
# Apply attention to values
out = attn_weights @ v # (B, H, T, head_dim)
# Concatenate heads and project
out = out.transpose(1, 2).reshape(B, T, C) # (B, T, C)
return self.out_proj(out)Why Multiple Heads Work
Think of it like reading a sentence with different "lenses":
- Head 1 might learn syntactic dependencies (subject ↔ verb agreement)
- Head 2 might track positional proximity (nearby words)
- Head 3 might capture semantic relationships (synonyms, topics)
- Head 4 might learn coreference ("it" → "the cat")
Each head has its own Q, K, V projections, so each develops its own specialization. The final linear projection learns to combine their outputs.
Part 4: The Full Transformer Block
Attention alone isn't enough. We also need:
- A feed-forward network (MLP) — for per-token computation and "thinking"
- Residual connections — so information can flow through unchanged
- Layer normalization — to keep activations stable during training
Here's the complete block:
class TransformerBlock(nn.Module):
"""
A single Transformer block: the fundamental repeating unit.
Stack N of these and you have GPT.
"""
def __init__(self, embed_dim: int, num_heads: int, ff_dim: int = None):
super().__init__()
ff_dim = ff_dim or 4 * embed_dim # convention: FFN is 4x wider
# Layer norms (pre-norm architecture — more stable than post-norm)
self.ln1 = nn.LayerNorm(embed_dim)
self.ln2 = nn.LayerNorm(embed_dim)
# Multi-head self-attention
self.attn = MultiHeadSelfAttention(embed_dim, num_heads)
# Feed-forward network (the "thinking" part)
self.mlp = nn.Sequential(
nn.Linear(embed_dim, ff_dim),
nn.GELU(),
nn.Linear(ff_dim, embed_dim),
)
def forward(self, x: torch.Tensor) -> torch.Tensor:
# Attention with residual connection
# "x + ..." means the original signal passes through unchanged,
# and attention only adds its contribution on top
x = x + self.attn(self.ln1(x))
# MLP with residual connection
x = x + self.mlp(self.ln2(x))
return xResidual Connections: The Highway
Without residual connections, signals must pass through every layer to reach the output. This makes deep networks hard to train — gradients either explode or vanish. The residual connection x = x + f(x) creates a "highway" that lets the original signal pass through untouched. The layer only needs to learn what to add.
# Without residual: signal must survive every transformation
x = layer_3(layer_2(layer_1(x))) # x is deeply transformed, gradients struggle
# With residual: original signal preserved, layers learn corrections
x = x + layer_1(x) # layer 1 adds its contribution
x = x + layer_2(x) # layer 2 adds its contribution
x = x + layer_3(x) # layer 3 adds its contributionFeynman Analogy: The Editor
Think of a document passing through editors. Without residuals, each editor rewrites the entire document — by the 50th editor, nothing of the original remains. With residuals, each editor only writes suggestions in the margins. The original document survives intact, and the suggestions accumulate.
The Feed-Forward Network: Where "Thinking" Happens
Attention lets tokens communicate — it's the "gathering information" step. The MLP (feed-forward network) lets each token compute — it's the "processing information" step. It projects to a higher dimension (4× by convention), applies a non-linearity, and projects back:
The expansion to 4× width gives the network more "room to think" before compressing back. Modern architectures often use SwiGLU instead of GELU (as used in Llama and nanochat):
class SwiGLU(nn.Module):
"""SwiGLU activation: modern replacement for GELU in FFN"""
def __init__(self, embed_dim, ff_dim):
super().__init__()
self.w1 = nn.Linear(embed_dim, ff_dim, bias=False)
self.w2 = nn.Linear(embed_dim, ff_dim, bias=False) # gate
self.w3 = nn.Linear(ff_dim, embed_dim, bias=False)
def forward(self, x):
return self.w3(F.silu(self.w1(x)) * self.w2(x))Layer Normalization: Keeping Things Stable
Without normalization, activations can grow or shrink as they pass through layers, making training unstable. LayerNorm normalizes each token's activation vector to have zero mean and unit variance:
A simpler variant, RMSNorm (used by Llama, Karpathy's microGPT, and most modern models), skips the mean centering:
# RMSNorm: simpler and slightly faster than LayerNorm
class RMSNorm(nn.Module):
def __init__(self, dim, eps=1e-5):
super().__init__()
self.eps = eps
self.weight = nn.Parameter(torch.ones(dim))
def forward(self, x):
rms = torch.sqrt(torch.mean(x ** 2, dim=-1, keepdim=True) + self.eps)
return x / rms * self.weightPart 5: Positional Encoding — Teaching Order to an Orderless System
Here's a subtle but critical problem: attention is permutation-invariant. If you shuffle the tokens in a sequence, the attention scores change, but the mechanism has no way to know the tokens are out of order. It treats the input as a set, not a sequence.
We need to inject positional information. There are three major approaches.
Approach 1: Sinusoidal Encoding (Original Transformer, 2017)
The original paper used fixed sinusoidal functions:
def sinusoidal_positional_encoding(max_len, d_model):
pe = torch.zeros(max_len, d_model)
position = torch.arange(0, max_len).unsqueeze(1).float()
div_term = torch.exp(torch.arange(0, d_model, 2).float() * -(math.log(10000.0) / d_model))
pe[:, 0::2] = torch.sin(position * div_term) # even dimensions
pe[:, 1::2] = torch.cos(position * div_term) # odd dimensions
return peApproach 2: Learned Positional Embeddings (GPT-2, microGPT)
Just learn a separate embedding for each position, the same way we learn embeddings for tokens:
# Simple learned position embeddings (GPT-2 style)
position_embedding = nn.Embedding(block_size, embed_dim)
# Usage: pos_emb = position_embedding(torch.arange(T))This is what Karpathy uses in microGPT — the wpe (weight position embedding) matrix. Simple and effective.
Approach 3: RoPE — Rotary Position Embeddings (Modern Standard)
RoPE (used by Llama, Mistral, DeepSeek, and most 2025–2026 models) is elegant: instead of adding position information, it rotates the query and key vectors by an angle proportional to their position. This means relative position is encoded directly in the dot product between Q and K.
def precompute_rope_freqs(dim, max_len, theta=10000.0):
"""Precompute the complex exponentials for RoPE"""
freqs = 1.0 / (theta ** (torch.arange(0, dim, 2).float() / dim))
t = torch.arange(max_len)
freqs = torch.outer(t, freqs) # (max_len, dim/2)
return torch.polar(torch.ones_like(freqs), freqs) # complex exponentials
def apply_rope(x, freqs):
"""Apply rotary embeddings to input tensor"""
# Reshape x into pairs and treat as complex numbers
x_complex = torch.view_as_complex(x.float().reshape(*x.shape[:-1], -1, 2))
# Rotate by the position-dependent frequencies
x_rotated = x_complex * freqs
# Convert back to real
return torch.view_as_real(x_rotated).reshape(*x.shape)Why RoPE Won
RoPE has two key advantages over learned embeddings: (1) it generalizes to sequence lengths not seen during training, and (2) the attention score between positions and depends only on (relative position), not their absolute positions. This is more natural for language — "the word 3 positions ago" is more useful information than "the word at position 47."
Part 6: The Complete GPT — Putting It All Together
Now we stack everything into a full language model:
class GPT(nn.Module):
"""
A complete GPT language model.
This is essentially what Karpathy builds in nanoGPT, simplified.
"""
def __init__(
self,
vocab_size: int,
embed_dim: int = 768,
num_heads: int = 12,
num_layers: int = 12,
block_size: int = 1024,
dropout: float = 0.0,
):
super().__init__()
self.block_size = block_size
# Token and position embeddings
self.token_embedding = nn.Embedding(vocab_size, embed_dim)
self.position_embedding = nn.Embedding(block_size, embed_dim)
self.drop = nn.Dropout(dropout)
# Stack of Transformer blocks
self.blocks = nn.Sequential(*[
TransformerBlock(embed_dim, num_heads)
for _ in range(num_layers)
])
# Final layer norm and output projection
self.ln_f = nn.LayerNorm(embed_dim)
self.lm_head = nn.Linear(embed_dim, vocab_size, bias=False)
# Weight tying: share weights between token embedding and output projection
# This is a key insight from the "Using the Output Embedding" paper
self.token_embedding.weight = self.lm_head.weight
# Initialize weights (important for stable training!)
self.apply(self._init_weights)
def _init_weights(self, module):
if isinstance(module, nn.Linear):
torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
if module.bias is not None:
torch.nn.init.zeros_(module.bias)
elif isinstance(module, nn.Embedding):
torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
def forward(self, idx: torch.Tensor) -> torch.Tensor:
B, T = idx.shape
assert T <= self.block_size, f"Sequence length {T} exceeds block size {self.block_size}"
# Create token + position embeddings
tok_emb = self.token_embedding(idx) # (B, T, C)
pos_emb = self.position_embedding(torch.arange(T, device=idx.device)) # (T, C)
x = self.drop(tok_emb + pos_emb) # (B, T, C)
# Pass through all transformer blocks
x = self.blocks(x) # (B, T, C)
# Final layer norm + project to vocabulary
x = self.ln_f(x) # (B, T, C)
logits = self.lm_head(x) # (B, T, vocab_size)
return logits
@torch.no_grad()
def generate(self, idx, max_new_tokens, temperature=1.0, top_k=None):
"""Autoregressive generation: predict one token at a time"""
for _ in range(max_new_tokens):
# Crop to block_size if needed
idx_cond = idx[:, -self.block_size:]
# Forward pass
logits = self(idx_cond)
# Take logits at the last position
logits = logits[:, -1, :] / temperature
# Optionally apply top-k filtering
if top_k is not None:
v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
logits[logits < v[:, [-1]]] = float('-inf')
# Sample from the distribution
probs = F.softmax(logits, dim=-1)
idx_next = torch.multinomial(probs, num_samples=1)
# Append to sequence
idx = torch.cat([idx, idx_next], dim=1)
return idxThe Training Loop
Training a GPT is conceptually simple: show it sequences, have it predict each next token, and adjust weights to make better predictions.
def train_gpt(model, train_data, epochs=1, batch_size=32, lr=3e-4, block_size=128):
optimizer = torch.optim.AdamW(model.parameters(), lr=lr, betas=(0.9, 0.95), weight_decay=0.1)
for epoch in range(epochs):
# Sample a random batch of training sequences
ix = torch.randint(len(train_data) - block_size, (batch_size,))
x = torch.stack([train_data[i:i+block_size] for i in ix]) # inputs
y = torch.stack([train_data[i+1:i+block_size+1] for i in ix]) # targets (shifted by 1)
# Forward pass: compute logits
logits = model(x) # (B, T, vocab_size)
# Cross-entropy loss: how wrong are our predictions?
loss = F.cross_entropy(
logits.view(-1, logits.size(-1)), # flatten to (B*T, vocab_size)
y.view(-1), # flatten to (B*T,)
)
# Backward pass: compute gradients
optimizer.zero_grad()
loss.backward()
# Gradient clipping: prevent exploding gradients
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
# Update parameters
optimizer.step()
if epoch % 100 == 0:
print(f"step {epoch}: loss {loss.item():.4f}")What Is Cross-Entropy Loss?
Cross-entropy measures how "surprised" the model is by the correct answer. If the model assigns high probability to the correct next token, loss is low. If it assigns low probability, loss is high. Training = minimize surprise.
Concretely: if the model says token "Paris" has probability 0.9 of being next, and that's correct, the loss is (low). If it says 0.01, the loss is (high).
Part 7: The Karpathy Progression — From Scalar Autograd to $100 ChatGPT
Karpathy's projects form a masterclass in progressive complexity. Each one builds on the last:
Stage 1: micrograd (2020) — Autograd from Scratch
A tiny autograd engine operating on individual scalars. This is the foundation — understanding backpropagation by building it yourself. Only ~150 lines of code.
# The core idea: wrap scalars to track gradients
class Value:
def __init__(self, data, children=(), local_grads=()):
self.data = data
self.grad = 0
self._children = children
self._local_grads = local_grads
def __add__(self, other):
other = other if isinstance(other, Value) else Value(other)
return Value(self.data + other.data, (self, other), (1, 1))
def __mul__(self, other):
other = other if isinstance(other, Value) else Value(other)
return Value(self.data * other.data, (self, other), (other.data, self.data))
# ... more operations, then backward() traverses the graphStage 2: makemore (2022) — Character-Level Language Models
A series of increasingly sophisticated character-level name generators: bigrams → MLPs → RNNs → Transformers. Teaches the progression from counting to neural networks.
Stage 3: minGPT (2020) → nanoGPT (2023) — Reproducing GPT-2
minGPT was the educational version (~300 lines for the model). nanoGPT was the "has teeth" rewrite — still ~300 lines for the model, but optimized to actually reproduce GPT-2 (124M parameters) on OpenWebText. The entire codebase is two files: model.py and train.py.
Stage 4: build-nanogpt (2024) — The 4-Hour Video Lecture
A complete step-by-step reproduction of GPT-2, starting from an empty file. The accompanying 4-hour YouTube lecture walks through every line. Key optimizations covered:
- Mixed precision training (bfloat16) — halves memory, speeds up compute
- Flash Attention — IO-aware attention that avoids materializing the N×N attention matrix
- torch.compile — fuses Python operations into optimized GPU kernels
- Gradient accumulation — simulate large batch sizes on limited GPU memory
- Distributed Data Parallel (DDP) — train across multiple GPUs
Stage 5: microGPT (Feb 2026) — 200 Lines, Zero Dependencies
The culmination. A complete GPT in pure Python — autograd, tokenizer, architecture, optimizer, training, and inference. No libraries whatsoever. As Karpathy puts it: "This file is the complete algorithm. Everything else is just efficiency."
The key insight: the entire algorithmic content of a GPT fits on a single printed page. Production GPTs differ only in scale, not in kind.
Stage 6: nanochat (Oct 2025, actively developed) — The $100 ChatGPT
The full stack — pretraining, supervised fine-tuning, reinforcement learning, and inference with a web UI. About 8,000 lines of clean code. Run one script on an 8×H100 node and 4 hours later you have a working ChatGPT clone.
Key architectural choices in nanochat that reflect the modern standard:
- Llama-style architecture with RMSNorm, SwiGLU, and RoPE
- Depth as the single complexity dial — all hyperparameters auto-scale from
--depth - Muon + AdamW optimizer combination
- BPE tokenizer trained in Rust for speed
- GRPO for reinforcement learning on math tasks (GSM8K)
Part 8: Modern Optimizations — What Makes Production GPTs Fast
Flash Attention
Standard attention materializes an attention matrix in GPU high-bandwidth memory (HBM). For sequence length 8,192 with 96 heads, that's gigabytes of memory — and the bottleneck isn't computation but memory bandwidth.
Flash Attention (Tri Dao, 2022) solves this by tiling the computation into blocks that fit in fast on-chip SRAM (~20MB), never materializing the full attention matrix. It's mathematically exact — same result, dramatically faster.
# In PyTorch 2.0+, Flash Attention is a one-liner:
from torch.nn.functional import scaled_dot_product_attention
# This automatically uses Flash Attention when available
out = scaled_dot_product_attention(q, k, v, is_causal=True)The progression: Flash Attention 1 (2022) → 2 (2023, better parallelism) → 3 (2024, Hopper/H100 optimized) → 4 (March 2026, achieving 1,605 TFLOPs/s on B200 GPUs with 71% hardware utilization).
Attention Variants: MHA → MQA → GQA → MLA
The KV cache during inference is a major memory bottleneck. Different attention variants trade off quality for cache efficiency:
- Multi-Head Attention (MHA): Every head has its own K and V. Full quality, large cache.
- Multi-Query Attention (MQA): All heads share one K and V. Small cache, some quality loss.
- Grouped-Query Attention (GQA): Groups of heads share K and V. The sweet spot — used by Llama 2 70B, Mistral 7B.
- Multi-head Latent Attention (MLA): Compresses K and V into a low-rank latent space. Used by DeepSeek-V3 — the frontier approach as of 2025–2026.
# GQA: Grouped-Query Attention (Llama 2 style)
# Instead of n_heads KV pairs, use n_kv_heads (e.g., 8 instead of 32)
class GroupedQueryAttention(nn.Module):
def __init__(self, embed_dim, n_heads, n_kv_heads):
super().__init__()
self.n_heads = n_heads
self.n_kv_heads = n_kv_heads
self.n_groups = n_heads // n_kv_heads # queries per KV group
self.head_dim = embed_dim // n_heads
self.wq = nn.Linear(embed_dim, n_heads * self.head_dim, bias=False)
self.wk = nn.Linear(embed_dim, n_kv_heads * self.head_dim, bias=False)
self.wv = nn.Linear(embed_dim, n_kv_heads * self.head_dim, bias=False)
self.wo = nn.Linear(n_heads * self.head_dim, embed_dim, bias=False)
def forward(self, x):
B, T, _ = x.shape
q = self.wq(x).view(B, T, self.n_heads, self.head_dim).transpose(1, 2)
k = self.wk(x).view(B, T, self.n_kv_heads, self.head_dim).transpose(1, 2)
v = self.wv(x).view(B, T, self.n_kv_heads, self.head_dim).transpose(1, 2)
# Repeat K, V for each group of query heads
k = k.repeat_interleave(self.n_groups, dim=1) # (B, n_heads, T, head_dim)
v = v.repeat_interleave(self.n_groups, dim=1)
# Standard scaled dot-product attention from here
out = F.scaled_dot_product_attention(q, k, v, is_causal=True)
out = out.transpose(1, 2).reshape(B, T, -1)
return self.wo(out)KV Cache: Why Inference Is Different From Training
During training, we process the entire sequence at once. During inference, we generate one token at a time. Without caching, we'd recompute the K and V projections for all previous tokens at every step — wasteful.
The KV cache stores previously computed K and V tensors, so we only compute Q, K, V for the new token and reuse the rest:
# Without KV cache: recompute everything at each step (slow)
for i in range(max_tokens):
logits = model(full_sequence) # processes ALL tokens every time
next_token = sample(logits[:, -1, :])
full_sequence = torch.cat([full_sequence, next_token], dim=1)
# With KV cache: only compute the new token (fast)
kv_cache = None
for i in range(max_tokens):
logits, kv_cache = model(new_token_only, kv_cache=kv_cache)
next_token = sample(logits)This is exactly what Karpathy implements in microGPT — the keys and values lists accumulate cached K and V vectors across positions.
Part 9: From Pretrained GPT to ChatGPT — The Full Pipeline
This is what nanochat teaches. A ChatGPT-style model isn't just a pretrained GPT — it goes through multiple stages:
Stage 1: Pretraining
Train on trillions of tokens of internet text. The model learns language, facts, reasoning patterns. This is the most expensive phase — $100M+ for frontier models. nanochat uses FineWeb-EDU (100B tokens) and trains in hours.
Stage 2: Mid-training / Continued Pretraining
Further train on curated data: user-assistant conversations (SmolTalk), multiple-choice questions, tool use examples. This teaches the model the format of being helpful.
Stage 3: Supervised Fine-Tuning (SFT)
Train on high-quality conversation examples. This is where the model learns to be a helpful assistant rather than just a text completer.
Stage 4: Reinforcement Learning (RL)
Optimize for specific capabilities. nanochat uses GRPO (Group Relative Policy Optimization) on math problems (GSM8K). Frontier models use RLHF (from human feedback) or RLVR (from verifiable rewards).
Stage 5: Inference
Serve the model efficiently with KV caching, batched prefill/decode, and optionally quantization. nanochat provides both a CLI and web UI.
Part 10: The Architecture Landscape in 2026
The Transformer's core insight — attention + MLP on a residual stream — remains dominant. But the details have evolved significantly:
| Component | Original Transformer (2017) | Modern Standard (2026) |
|---|---|---|
| Normalization | Post-norm LayerNorm | Pre-norm RMSNorm |
| Activation | ReLU | SwiGLU / GeGLU |
| Position encoding | Sinusoidal (fixed) | RoPE (rotary) |
| Attention | Multi-Head (MHA) | GQA or MLA |
| Architecture | Encoder-Decoder | Decoder-only |
| Attention kernel | Naive O(N²) memory | Flash Attention |
| Optimizer | Adam | AdamW + warmup + cosine decay |
| Precision | FP32 | BF16 / FP8 |
Emerging Alternatives
The Transformer isn't the only game in town anymore:
- State Space Models (Mamba): Replace attention with structured state spaces. O(N) instead of O(N²) in sequence length. Mamba-3 (March 2026) shows these can match Transformers at scale.
- Mixture of Experts (MoE): Scale parameters without proportional compute. DeepSeek-V3 has 671B total parameters but only activates ~37B per token.
- Hybrid architectures: Combine attention layers with SSM layers — get the best of both worlds.
But for now, the core insight holds: attention is (mostly) all you need.
Part 11: Build It Yourself — The Feynman Challenge
The best way to understand Transformers is to build one. Here's a recommended progression:
-
Start with microGPT (200 lines, pure Python) — Read every line. Run it. Modify the hyperparameters. Break it and fix it. This teaches the algorithm.
-
Move to nanoGPT (~600 lines, PyTorch) — Train on Shakespeare. This teaches practical implementation.
-
Watch the build-nanogpt lecture (4 hours) — Follow along and reproduce GPT-2. This teaches optimization.
-
Run nanochat's speedrun (~8,000 lines, full stack) — Go from empty GPU to working chatbot. This teaches the complete pipeline.
Each step reveals something the previous one hid. As Karpathy wrote about microGPT: "This is the complete algorithm. Everything else is just efficiency." Once you understand the 200-line version, every other GPT is just the same algorithm with engineering added on top.
The Karpathy Philosophy
"What I cannot create, I do not understand." Start from an empty file. Build everything from scratch. Only then do you truly understand what the frameworks are doing for you — and more importantly, what they're hiding from you.
Appendix: Key Resources
- microGPT — karpathy.ai/microgpt.html — The 200-line pure Python GPT
- nanoGPT — github.com/karpathy/nanoGPT — The ~600-line PyTorch GPT (now deprecated in favor of nanochat)
- build-nanogpt — github.com/karpathy/build-nanogpt — Step-by-step GPT-2 reproduction with video lecture
- nanochat — github.com/karpathy/nanochat — Full-stack ChatGPT clone for $100
- "Attention Is All You Need" — arxiv.org/abs/1706.03762 — The original Transformer paper
- Zero to Hero series — Karpathy's YouTube lectures building neural networks from scratch
- Transformer Explainer — poloclub.github.io/transformer-explainer — Interactive visual explanation
"The most atomic way to train and run inference for a GPT in pure, dependency-free Python. This file is the complete algorithm. Everything else is just efficiency." — Andrej Karpathy, microGPT (2026)