Stanford CS229 — Building Large Language Models (LLMs)

Lecture by: Yann Dubois (PhD scholar, Stanford)
Course: Stanford CS229 — Machine Learning
Published: August 27, 2024
Views: 2.15M+
Video: YouTube
Full notes: GitHub — Mac-007

Overview

This lecture provides a concise, end-to-end overview of how Large Language Models (LLMs) like ChatGPT, Claude, Gemini, and Llama are built. It covers:

Pretraining — teaching the model language from scratch
Post-training (Alignment) — making models helpful and safe
Data collection & processing — the unsung hero of LLM quality
Evaluation — measuring what matters
Systems optimization — making training feasible

Core thesis: While architecture matters, data quality, evaluation methods, and systems optimizations play a more significant role in practical LLM performance than architectural tweaks.

1. Pretraining

Language Modeling Task

Autoregressive models predict the next token given previous tokens:

P(x₁, x₂, ..., x_L) = ∏ P(x_i | x_<i)

Input: “She likely prefers”
Model predicts: “dogs” (or other plausible continuations)
Loss: Cross-entropy — compare predicted probabilities against actual tokens

Tokenization (Byte Pair Encoding)

Why not just words or characters?

Approach	Problem
Words	Typos, rare words, no boundaries in CJK
Characters	Very long sequences → expensive compute

BPE algorithm:

Start with all characters as tokens
Merge the most frequent adjacent pairs iteratively
Build a vocabulary of subword tokens

Tokenization significantly impacts model performance and efficiency.

Evaluation During Pretraining

Perplexity: 2^CrossEntropyLoss — lower is better. Represents the effective number of choices the model sees at each step. But depends on tokenizer and data.

Academic benchmarks: HELM, Hugging Face Open Leaderboard, MMLU (Massive Multitask Language Understanding).

Key challenges: inconsistencies between evaluation methods, train-test contamination.

Data Collection & Processing Pipeline

Source — Common Crawl (~250B pages)
Text extraction — strip HTML
Content filtering — remove NSFW, harmful, private info; use blacklists
Deduplication — remove duplicates to avoid overrepresentation
Heuristic filtering — rules-based (e.g., unusual token distributions)
Model-based filtering — train classifiers to select high-quality data
Domain classification & balancing — categorize into code, books, etc.; emphasize high-quality domains
Final fine-tuning — focus on Wikipedia; slight overfitting improves understanding

Data quality is paramount. Extensive cleaning is required.

Scaling Laws

Loss scales as a power-law with model size and data
Chinchilla optimal ratio: ~20 tokens per parameter
Guides resource allocation: model size ↔ data quantity ↔ compute

2. Post-Training (Alignment)

Pretrained LLMs don’t follow instructions or may generate undesirable outputs. Alignment fixes this.

Supervised Fine-Tuning (SFT)

Human experts create instruction-response pairs
Train on this data using the same language modeling loss
Surprisingly small datasets (e.g., 2,000 examples) can significantly influence behavior
Limitation: quality depends on humans; doesn’t generalize beyond examples

Reinforcement Learning from Human Feedback (RLHF)

Why RLHF over SFT:

Humans are better judges than generators
Mitigates hallucinations
More cost-effective than writing perfect answers

Process:

Humans compare multiple model outputs → preferences
Train a reward model to predict human preferences
Fine-tune the LLM using PPO (Proximal Policy Optimization) to maximize reward

Direct Preference Optimization (DPO): Simpler alternative to PPO. Uses standard supervised learning to prefer better responses over worse ones. Comparable performance without complex RL.

Data Collection for RLHF

Source	Pros	Cons
Human feedback	Gold standard	Expensive, inconsistent
Synthetic (LLM-generated)	Scalable, consistent	Biases, over-optimization risk

Evaluating Aligned LLMs

Perplexity is no longer useful post-training. Instead:

Human evaluation — blind comparisons (expensive but gold standard)
LLM-as-judge — use models to evaluate models (scalable, but has biases: longer = better, position bias, etc.)

3. Systems Optimization

Training LLMs is resource-intensive. Optimizations make it feasible.

Low Precision Training

Use 16-bit floating point instead of 32-bit
Benefits: faster arithmetic, lower memory → larger batches/models
Mixed precision: combine low-precision with higher-precision updates where needed

Operator Fusion

Problem: separate operations cause unnecessary data movement, underutilize GPUs
Solution: combine multiple sequential ops into a single fused kernel
Tools: PyTorch torch.compile, custom CUDA kernels

Key Takeaways

Data quality > data quantity — filtering and curation matter more than raw scale
Scaling laws guide strategy — ~20 tokens per parameter is the sweet spot
Alignment transforms models — SFT + RLHF/DPO make LLMs usable and safe
Systems optimization is essential — without it, training is economically infeasible
Evaluation is still an open problem — especially for open-ended, aligned models

Huy's Wiki

Explorer

stanford-cs229-building-llms