Stanford CS229 — Building Large Language Models (LLMs)
- Lecture by: Yann Dubois (PhD scholar, Stanford)
- Course: Stanford CS229 — Machine Learning
- Published: August 27, 2024
- Views: 2.15M+
- Video: YouTube
- Full notes: GitHub — Mac-007
Overview
This lecture provides a concise, end-to-end overview of how Large Language Models (LLMs) like ChatGPT, Claude, Gemini, and Llama are built. It covers:
- Pretraining — teaching the model language from scratch
- Post-training (Alignment) — making models helpful and safe
- Data collection & processing — the unsung hero of LLM quality
- Evaluation — measuring what matters
- Systems optimization — making training feasible
Core thesis: While architecture matters, data quality, evaluation methods, and systems optimizations play a more significant role in practical LLM performance than architectural tweaks.
1. Pretraining
Language Modeling Task
Autoregressive models predict the next token given previous tokens:
P(x₁, x₂, ..., x_L) = ∏ P(x_i | x_<i)
- Input: “She likely prefers”
- Model predicts: “dogs” (or other plausible continuations)
- Loss: Cross-entropy — compare predicted probabilities against actual tokens
Tokenization (Byte Pair Encoding)
Why not just words or characters?
| Approach | Problem |
|---|---|
| Words | Typos, rare words, no boundaries in CJK |
| Characters | Very long sequences → expensive compute |
BPE algorithm:
- Start with all characters as tokens
- Merge the most frequent adjacent pairs iteratively
- Build a vocabulary of subword tokens
Tokenization significantly impacts model performance and efficiency.
Evaluation During Pretraining
Perplexity: 2^CrossEntropyLoss — lower is better. Represents the effective number of choices the model sees at each step. But depends on tokenizer and data.
Academic benchmarks: HELM, Hugging Face Open Leaderboard, MMLU (Massive Multitask Language Understanding).
Key challenges: inconsistencies between evaluation methods, train-test contamination.
Data Collection & Processing Pipeline
- Source — Common Crawl (~250B pages)
- Text extraction — strip HTML
- Content filtering — remove NSFW, harmful, private info; use blacklists
- Deduplication — remove duplicates to avoid overrepresentation
- Heuristic filtering — rules-based (e.g., unusual token distributions)
- Model-based filtering — train classifiers to select high-quality data
- Domain classification & balancing — categorize into code, books, etc.; emphasize high-quality domains
- Final fine-tuning — focus on Wikipedia; slight overfitting improves understanding
Data quality is paramount. Extensive cleaning is required.
Scaling Laws
- Loss scales as a power-law with model size and data
- Chinchilla optimal ratio: ~20 tokens per parameter
- Guides resource allocation: model size ↔ data quantity ↔ compute
2. Post-Training (Alignment)
Pretrained LLMs don’t follow instructions or may generate undesirable outputs. Alignment fixes this.
Supervised Fine-Tuning (SFT)
- Human experts create instruction-response pairs
- Train on this data using the same language modeling loss
- Surprisingly small datasets (e.g., 2,000 examples) can significantly influence behavior
- Limitation: quality depends on humans; doesn’t generalize beyond examples
Reinforcement Learning from Human Feedback (RLHF)
Why RLHF over SFT:
- Humans are better judges than generators
- Mitigates hallucinations
- More cost-effective than writing perfect answers
Process:
- Humans compare multiple model outputs → preferences
- Train a reward model to predict human preferences
- Fine-tune the LLM using PPO (Proximal Policy Optimization) to maximize reward
Direct Preference Optimization (DPO): Simpler alternative to PPO. Uses standard supervised learning to prefer better responses over worse ones. Comparable performance without complex RL.
Data Collection for RLHF
| Source | Pros | Cons |
|---|---|---|
| Human feedback | Gold standard | Expensive, inconsistent |
| Synthetic (LLM-generated) | Scalable, consistent | Biases, over-optimization risk |
Evaluating Aligned LLMs
Perplexity is no longer useful post-training. Instead:
- Human evaluation — blind comparisons (expensive but gold standard)
- LLM-as-judge — use models to evaluate models (scalable, but has biases: longer = better, position bias, etc.)
3. Systems Optimization
Training LLMs is resource-intensive. Optimizations make it feasible.
Low Precision Training
- Use 16-bit floating point instead of 32-bit
- Benefits: faster arithmetic, lower memory → larger batches/models
- Mixed precision: combine low-precision with higher-precision updates where needed
Operator Fusion
- Problem: separate operations cause unnecessary data movement, underutilize GPUs
- Solution: combine multiple sequential ops into a single fused kernel
- Tools: PyTorch
torch.compile, custom CUDA kernels
Key Takeaways
- Data quality > data quantity — filtering and curation matter more than raw scale
- Scaling laws guide strategy — ~20 tokens per parameter is the sweet spot
- Alignment transforms models — SFT + RLHF/DPO make LLMs usable and safe
- Systems optimization is essential — without it, training is economically infeasible
- Evaluation is still an open problem — especially for open-ended, aligned models
Further Reading
- CS224N — NLP with Deep Learning
- CS324 — Large Language Models
- CS336 — Building LLMs from Scratch