Stanford CS229 — Building Large Language Models (LLMs)

  • Lecture by: Yann Dubois (PhD scholar, Stanford)
  • Course: Stanford CS229 — Machine Learning
  • Published: August 27, 2024
  • Views: 2.15M+
  • Video: YouTube
  • Full notes: GitHub — Mac-007

Overview

This lecture provides a concise, end-to-end overview of how Large Language Models (LLMs) like ChatGPT, Claude, Gemini, and Llama are built. It covers:

  1. Pretraining — teaching the model language from scratch
  2. Post-training (Alignment) — making models helpful and safe
  3. Data collection & processing — the unsung hero of LLM quality
  4. Evaluation — measuring what matters
  5. Systems optimization — making training feasible

Core thesis: While architecture matters, data quality, evaluation methods, and systems optimizations play a more significant role in practical LLM performance than architectural tweaks.


1. Pretraining

Language Modeling Task

Autoregressive models predict the next token given previous tokens:

P(x₁, x₂, ..., x_L) = ∏ P(x_i | x_<i)
  • Input: “She likely prefers”
  • Model predicts: “dogs” (or other plausible continuations)
  • Loss: Cross-entropy — compare predicted probabilities against actual tokens

Tokenization (Byte Pair Encoding)

Why not just words or characters?

ApproachProblem
WordsTypos, rare words, no boundaries in CJK
CharactersVery long sequences → expensive compute

BPE algorithm:

  1. Start with all characters as tokens
  2. Merge the most frequent adjacent pairs iteratively
  3. Build a vocabulary of subword tokens

Tokenization significantly impacts model performance and efficiency.

Evaluation During Pretraining

Perplexity: 2^CrossEntropyLoss — lower is better. Represents the effective number of choices the model sees at each step. But depends on tokenizer and data.

Academic benchmarks: HELM, Hugging Face Open Leaderboard, MMLU (Massive Multitask Language Understanding).

Key challenges: inconsistencies between evaluation methods, train-test contamination.

Data Collection & Processing Pipeline

  1. Source — Common Crawl (~250B pages)
  2. Text extraction — strip HTML
  3. Content filtering — remove NSFW, harmful, private info; use blacklists
  4. Deduplication — remove duplicates to avoid overrepresentation
  5. Heuristic filtering — rules-based (e.g., unusual token distributions)
  6. Model-based filtering — train classifiers to select high-quality data
  7. Domain classification & balancing — categorize into code, books, etc.; emphasize high-quality domains
  8. Final fine-tuning — focus on Wikipedia; slight overfitting improves understanding

Data quality is paramount. Extensive cleaning is required.

Scaling Laws

  • Loss scales as a power-law with model size and data
  • Chinchilla optimal ratio: ~20 tokens per parameter
  • Guides resource allocation: model size ↔ data quantity ↔ compute

2. Post-Training (Alignment)

Pretrained LLMs don’t follow instructions or may generate undesirable outputs. Alignment fixes this.

Supervised Fine-Tuning (SFT)

  • Human experts create instruction-response pairs
  • Train on this data using the same language modeling loss
  • Surprisingly small datasets (e.g., 2,000 examples) can significantly influence behavior
  • Limitation: quality depends on humans; doesn’t generalize beyond examples

Reinforcement Learning from Human Feedback (RLHF)

Why RLHF over SFT:

  • Humans are better judges than generators
  • Mitigates hallucinations
  • More cost-effective than writing perfect answers

Process:

  1. Humans compare multiple model outputs → preferences
  2. Train a reward model to predict human preferences
  3. Fine-tune the LLM using PPO (Proximal Policy Optimization) to maximize reward

Direct Preference Optimization (DPO): Simpler alternative to PPO. Uses standard supervised learning to prefer better responses over worse ones. Comparable performance without complex RL.

Data Collection for RLHF

SourceProsCons
Human feedbackGold standardExpensive, inconsistent
Synthetic (LLM-generated)Scalable, consistentBiases, over-optimization risk

Evaluating Aligned LLMs

Perplexity is no longer useful post-training. Instead:

  • Human evaluation — blind comparisons (expensive but gold standard)
  • LLM-as-judge — use models to evaluate models (scalable, but has biases: longer = better, position bias, etc.)

3. Systems Optimization

Training LLMs is resource-intensive. Optimizations make it feasible.

Low Precision Training

  • Use 16-bit floating point instead of 32-bit
  • Benefits: faster arithmetic, lower memory → larger batches/models
  • Mixed precision: combine low-precision with higher-precision updates where needed

Operator Fusion

  • Problem: separate operations cause unnecessary data movement, underutilize GPUs
  • Solution: combine multiple sequential ops into a single fused kernel
  • Tools: PyTorch torch.compile, custom CUDA kernels

Key Takeaways

  1. Data quality > data quantity — filtering and curation matter more than raw scale
  2. Scaling laws guide strategy — ~20 tokens per parameter is the sweet spot
  3. Alignment transforms models — SFT + RLHF/DPO make LLMs usable and safe
  4. Systems optimization is essential — without it, training is economically infeasible
  5. Evaluation is still an open problem — especially for open-ended, aligned models

Further Reading

  • CS224N — NLP with Deep Learning
  • CS324 — Large Language Models
  • CS336 — Building LLMs from Scratch