YouTube Talking Head Video: Minimal-Human AI Pipeline

A complete production workflow for talking head YouTube videos with illustrating animations. Works from raw footage. Target: ~20 minutes human effort per video.

Architecture Overview

raw.mp4
  └─ Stage 1: Whisper → timestamped transcript
       └─ Stage 2: Claude EDL → FFmpeg cuts → cut.mp4
            └─ Stage 3: Claude illustration brief → per-segment JSON
                 └─ Stage 4: HyperFrames → overlay animations
                      └─ Stage 5: Claude writes CapCut project JSON
                           └─ Stage 6: Polish (color, SFX, BGM, thumbnail)
                                └─ published.mp4

Human checkpoints: after Stage 2 (5 min), after Stage 4 (10 min), after Stage 6 (5 min).

Tool Stack

Layer	Tool	Why
Transcription	Whisper `large-v3`	Best word-level timestamps, local, free
Video processing	FFmpeg	Universal, scriptable, no GUI
Orchestration	Claude Code	Reads transcript, generates all downstream artifacts
Animation	HyperFrames (HeyGen)	HTML → MP4, agent-native, GSAP/Three.js/Lottie
Visual assets	Claude Design (`claude.ai/design`)	Motion graphics with green screen bg
Assembly	CapCut project file (JSON)	Claude writes it directly, no GUI clicks
Captions	Whisper SRT + FFmpeg	Auto-generated, burned or soft
Thumbnail	Claude Design + FFmpeg frame extract	3 variants from best frame

Optional Upgrades

Scenario	Swap
Professional color grading	DaVinci Resolve (Claude writes .drp nodes)
Complex 3D illustrations	Three.js inside HyperFrames
Data-driven charts	Remotion (React components)
Photorealistic stock visuals	Seedance / Kling via API

Stage 1 — Ingest & Transcribe

Human time: 0 min. Runs automatically.

Command

whisper raw.mp4 \
  --model large-v3 \
  --output_format json \
  --word_timestamps true \
  --output_dir ./pipeline/01_transcript/

Output: `transcript.json`

{
  "segments": [
    {
      "id": 0,
      "start": 0.0,
      "end": 4.2,
      "text": "Today I want to talk about HTTP caching.",
      "words": [
        {"word": "Today", "start": 0.0, "end": 0.3, "probability": 0.99},
        {"word": "I", "start": 0.32, "end": 0.38, "probability": 0.99},
        ...
      ]
    }
  ]
}

What Claude reads next

Claude Code ingests transcript.json and flags every word/segment for:

Silence gaps > 0.4s between words → cut candidate
Filler words (uh, um, like, you know, sort of, basically, literally, right?) → cut candidate
False starts (sentence repeated within 5s of itself) → keep second take, cut first
Topic shifts (semantic analysis) → illustration insertion point marker
Key terms (nouns, concepts mentioned for first time) → callout annotation candidate

Stage 2 — Smart Cutting

Human time: 5 min spot-check.

Claude generates EDL

Claude Code produces a Python script that builds FFmpeg filter complex from flagged regions:

# pipeline/02_cuts/generate_edl.py
# Claude writes this based on transcript analysis
 
CUTS = [
    # (keep_start, keep_end, reason)
    (0.00,  4.20, "intro"),
    (5.80, 12.10, "skip filler 4.2-5.8: 'uh...um...'"),
    (12.10, 18.45, "keep"),
    (19.20, 31.00, "skip false start 18.45-19.2"),
    ...
]
 
def build_ffmpeg_cmd(cuts, input_file, output_file):
    segments = []
    filter_parts = []
    for i, (start, end, _) in enumerate(cuts):
        filter_parts.append(
            f"[0:v]trim=start={start}:end={end},setpts=PTS-STARTPTS[v{i}];"
            f"[0:a]atrim=start={start}:end={end},asetpts=PTS-STARTPTS[a{i}];"
        )
        segments.append(f"[v{i}][a{i}]")
    
    concat = "".join(segments) + f"concat=n={len(cuts)}:v=1:a=1[vout][aout]"
    filter_complex = "".join(filter_parts) + concat
    
    return [
        "ffmpeg", "-i", input_file,
        "-filter_complex", filter_complex,
        "-map", "[vout]", "-map", "[aout]",
        "-c:v", "libx264", "-crf", "18",
        "-c:a", "aac", "-b:a", "192k",
        output_file
    ]

Output: `pipeline/02_cuts/cut.mp4`

Clean talking head, no dead air, no filler. Transcript updated with new timestamps (transcript_cut.json).

Human checkpoint

Watch at 2× speed. Reject any cut that broke a sentence. Re-run with adjusted thresholds.

Stage 3 — Illustration Planning

Human time: 0 min.

Claude produces illustration brief

Claude Code reads transcript_cut.json, performs semantic analysis, outputs:

// pipeline/03_briefs/illustrations.json
[
  {
    "id": "ill_001",
    "timestamp_in": "00:00:45.2",
    "timestamp_out": "00:00:53.0",
    "duration_s": 7.8,
    "mode": "picture_in_picture",
    "concept": "HTTP request-response cycle",
    "trigger_quote": "...the browser sends a request and waits for the server to respond...",
    "illustration_type": "animated_diagram",
    "elements": [
      "browser icon on left",
      "server icon on right",
      "arrow labeled GET /page flying left-to-right",
      "arrow labeled 200 OK flying right-to-left",
      "timestamps on arrows"
    ],
    "style": "dark background, monospace labels, blue accent color",
    "animation": "sequential reveal, arrows draw in sync with speech"
  },
  {
    "id": "ill_002",
    "timestamp_in": "00:02:10.0",
    "timestamp_out": "00:02:15.5",
    "duration_s": 5.5,
    "mode": "lower_third",
    "concept": "Cache-Control header definition",
    "trigger_quote": "...Cache-Control is the header that tells browsers how long to store...",
    "illustration_type": "callout",
    "elements": [
      "term: Cache-Control",
      "definition: HTTP header controlling caching behavior",
      "example: Cache-Control: max-age=3600"
    ],
    "style": "bottom 25% of frame, semi-transparent bg, monospace font"
  },
  {
    "id": "ill_003",
    "timestamp_in": "00:04:30.0",
    "timestamp_out": "00:04:45.0",
    "duration_s": 15.0,
    "mode": "fullscreen",
    "concept": "Cache hit vs cache miss flow",
    "trigger_quote": "...so on a cache hit, the browser never contacts the server at all...",
    "illustration_type": "flowchart",
    "elements": [
      "START: Browser needs resource",
      "DIAMOND: In cache? + not expired?",
      "YES path → Return cached → END",
      "NO path → Fetch from server → Store in cache → Return → END"
    ],
    "style": "dark bg, green for cache hit path, orange for miss path"
  }
]

Insertion modes

Mode	When to use	Implementation
`fullscreen`	Complex diagram needs full attention, speaker pauses	Replace talking head for duration
`picture_in_picture`	Speaker continues talking while showing concept	Talking head shrinks to corner (20% size), diagram fills 80%
`lower_third`	Quick definition, stat, or code snippet	Bottom 25% overlay, transparent bg, speaker visible
`side_by_side`	Comparison, before/after	Frame splits 50/50

Stage 4 — Illustration Generation

Human time: 10 min review + prompt-fix.

HyperFrames pipeline per illustration

Claude Code reads each item in illustrations.json and generates a HyperFrames HTML composition:

# For each illustration brief:
claude --print "Generate a HyperFrames HTML composition for: $(cat ill_001.json)" \
  > pipeline/04_animations/ill_001.html
 
npx hyperframes render pipeline/04_animations/ill_001.html \
  --output pipeline/04_animations/ill_001.mp4 \
  --duration 7.8 \
  --fps 60 \
  --width 1920 --height 1080

Example HyperFrames output for `ill_001` (animated diagram)

<!DOCTYPE html>
<html>
<head>
  <script src="https://cdnjs.cloudflare.com/ajax/libs/gsap/3.12.0/gsap.min.js"></script>
  <style>
    body { margin: 0; background: #0f1117; font-family: 'JetBrains Mono', monospace; }
    .stage { width: 1920px; height: 1080px; position: relative; display: flex;
             align-items: center; justify-content: space-between; padding: 200px; }
    .node { display: flex; flex-direction: column; align-items: center; gap: 16px; opacity: 0; }
    .icon { width: 120px; height: 120px; }
    .label { color: #94a3b8; font-size: 24px; }
    .arrow-container { position: absolute; width: 600px; left: 50%; transform: translateX(-50%); }
    .arrow { opacity: 0; }
    .arrow-label { color: #60a5fa; font-size: 20px; }
  </style>
</head>
<body>
  <div class="stage">
    <div class="node" id="browser">
      <img class="icon" src="browser-icon.svg">
      <span class="label">Browser</span>
    </div>
    <div class="arrow-container">
      <div class="arrow" id="req-arrow">→ GET /page ───────────────</div>
      <div class="arrow" id="res-arrow">←────────────── 200 OK ←</div>
    </div>
    <div class="node" id="server">
      <img class="icon" src="server-icon.svg">
      <span class="label">Server</span>
    </div>
  </div>
  <script>
    // HyperFrames seek-driven: register on window.__hfAnime
    const tl = gsap.timeline({ paused: true });
    tl.to('#browser', { opacity: 1, duration: 0.4 })
      .to('#server',  { opacity: 1, duration: 0.4 }, '<')
      .to('#req-arrow', { opacity: 1, duration: 0.6 })
      .to('#res-arrow', { opacity: 1, duration: 0.6 }, '+=0.8');
    window.__hfAnime = (t) => { tl.seek(t); };
  </script>
</body>
</html>

Animation types Claude generates

Type	Technology	Use case
Flowchart / DAG	SVG + GSAP path draw	Decision trees, processes, flows
Code reveal	`<pre>` + GSAP stagger	Step-by-step code explanation
Data chart	Canvas / D3-lite	Stats, comparisons, growth
Network diagram	SVG circles + lines	Architecture, APIs, relationships
Lower third	CSS + GSAP slide-in	Term definitions, stats, quotes
3D model	Three.js	Conceptual objects, spatial relationships
Lottie animation	Lottie player	Icons, micro-animations
Text kinetic	GSAP SplitText	Key phrases, emphasis

For photorealistic assets

Claude Design prompt:
"Create a motion graphic showing [concept] on a pure green screen (#00ff00) background.
Animate in from left, hold 5s, animate out right. Export as single continuous video.
Style: [dark/light], [color palette]."

→ Download from Claude Design
→ FFmpeg chroma key: -vf "chromakey=0x00ff00:0.1:0.2"
→ Composite over talking head

Human checkpoint

Open each ill_*.mp4. For any that miss the concept, write a corrected brief and re-run Claude + HyperFrames. Typically 1–2 iterations on complex diagrams.

Stage 5 — Assembly

Human time: 0 min.

Claude writes CapCut project JSON

CapCut stores projects as a JSON file (draft_content.json) in ~/Movies/CapCut/User Data/Projects/com.lveditor.draft/<project-id>/.

Claude Code generates this file directly:

{
  "tracks": [
    {
      "type": "video",
      "segments": [
        {
          "material_id": "cut_mp4",
          "target_timerange": {"start": 0, "duration": 3920000},
          "source_timerange": {"start": 0, "duration": 3920000}
        }
      ]
    },
    {
      "type": "video",
      "segments": [
        {
          "material_id": "ill_001_mp4",
          "target_timerange": {"start": 2720000, "duration": 780000},
          "extra_material_refs": ["pip_effect"],
          "clip": {
            "scale": {"x": 0.2, "y": 0.2},
            "position": {"x": 0.75, "y": 0.75}
          }
        },
        {
          "material_id": "ill_002_mp4",
          "target_timerange": {"start": 7800000, "duration": 550000},
          "extra_material_refs": ["lower_third_effect"]
        }
      ]
    },
    {
      "type": "text",
      "segments": [] // captions injected from SRT below
    },
    {
      "type": "audio",
      "segments": [
        {
          "material_id": "bgm_mp3",
          "target_timerange": {"start": 0, "duration": 3920000},
          "volume": 0.12
        }
      ]
    }
  ],
  "materials": {
    "videos": [
      {"id": "cut_mp4", "path": "/pipeline/02_cuts/cut.mp4"},
      {"id": "ill_001_mp4", "path": "/pipeline/04_animations/ill_001.mp4"},
      {"id": "ill_002_mp4", "path": "/pipeline/04_animations/ill_002.mp4"}
    ],
    "audios": [
      {"id": "bgm_mp3", "path": "/pipeline/06_polish/bgm.mp3"}
    ]
  }
}

CapCut timestamps use microseconds (1s = 1,000,000 units). Claude converts from seconds automatically.

Alternative: pure FFmpeg assembly (no CapCut)

For fully headless pipeline:

# Claude generates this ffmpeg command
ffmpeg \
  -i pipeline/02_cuts/cut.mp4 \
  -i pipeline/04_animations/ill_001.mp4 \
  -i pipeline/04_animations/ill_003.mp4 \
  -i pipeline/06_polish/bgm.mp3 \
  -filter_complex "
    [0:v][1:v]overlay=W*0.75:H*0.75:enable='between(t,45.2,53.0)'[v1];
    [v1][2:v]overlay=0:0:enable='between(t,270.0,285.0)'[v2];
    [0:a]volume=1[speech];
    [3:a]volume=0.12[bgm];
    [speech][bgm]amix=inputs=2:duration=first[aout]
  " \
  -map "[v2]" -map "[aout]" \
  -c:v libx264 -crf 16 -preset slow \
  -c:a aac -b:a 192k \
  pipeline/05_assembly/assembled.mp4

Captions

# Whisper already produced transcript.json
# Convert to SRT:
python -c "
import json, sys
data = json.load(open('pipeline/01_transcript/transcript.json'))
for i, seg in enumerate(data['segments'], 1):
    start = seg['start']; end = seg['end']
    print(f'{i}')
    print(f'{int(start//3600):02}:{int((start%3600)//60):02}:{start%60:06.3f}'.replace('.',',') +
          ' --> ' +
          f'{int(end//3600):02}:{int((end%3600)//60):02}:{end%60:06.3f}'.replace('.',','))
    print(seg['text'].strip())
    print()
" > pipeline/05_assembly/captions.srt
 
# Burn into video:
ffmpeg -i assembled.mp4 -vf "subtitles=captions.srt:force_style='FontName=Inter,FontSize=28,Bold=1,Outline=2'" -c:a copy final.mp4

Stage 6 — Polish

Human time: 5 min final review.

Color grading

Claude Code generates a LUT or CapCut color node:

Claude prompt: "Generate a color grade LUT for a talking head video shot
in a home office. Warm skin tones, slight contrast boost, teal shadows.
Output as .cube format."

Or for DaVinci Resolve, Claude writes node parameters directly to .drp file.

BGM placement

# Claude selects BGM track from royalty-free library
# Ducks under speech using FFmpeg sidechain compression
 
ffmpeg -i assembled.mp4 -i bgm.mp3 -filter_complex "
  [1:a]volume=0.15[bgm];
  [0:a][bgm]sidechaincompress=
    threshold=0.02:ratio=4:attack=200:release=1000[aout]
" -map 0:v -map "[aout]" -c:v copy polished.mp4

SFX placement

Claude reads transcript, identifies transition moments (topic shifts, cuts between sections), places SFX at exact timestamps:

SFX_MAP = [
    (0.0,   "pipeline/sfx/intro_whoosh.mp3"),
    (45.2,  "pipeline/sfx/transition_tone.mp3"),  # illustration in
    (270.0, "pipeline/sfx/section_shift.mp3"),    # new topic
]

Thumbnail

# Extract best frame (Claude picks timestamp from transcript - hook moment)
ffmpeg -i cut.mp4 -ss 00:00:08 -vframes 1 pipeline/06_polish/best_frame.png
 
# Claude Design generates 3 thumbnail variants using best_frame.png as reference
# Human picks one

Directory Structure

project/
├── raw.mp4                          # input
├── pipeline/
│   ├── 01_transcript/
│   │   ├── transcript.json          # Whisper output
│   │   └── transcript_cut.json      # updated after cuts
│   ├── 02_cuts/
│   │   ├── edl.py                   # Claude-generated cut script
│   │   └── cut.mp4                  # clean talking head
│   ├── 03_briefs/
│   │   └── illustrations.json       # Claude-generated brief
│   ├── 04_animations/
│   │   ├── ill_001.html             # HyperFrames source
│   │   ├── ill_001.mp4              # rendered animation
│   │   └── ...
│   ├── 05_assembly/
│   │   ├── captions.srt
│   │   ├── draft_content.json       # CapCut project file
│   │   └── assembled.mp4
│   └── 06_polish/
│       ├── bgm.mp3
│       ├── polished.mp4
│       └── thumbnails/
│           ├── variant_a.png
│           ├── variant_b.png
│           └── variant_c.png
└── published.mp4                    # final output

Claude Code Prompts (Copy-Paste)

Stage 2: Cut analysis

Read pipeline/01_transcript/transcript.json. Produce pipeline/02_cuts/edl.py.

Rules:
- Remove silence gaps > 0.4s between words
- Remove these filler words: uh, um, like, you know, sort of, basically, literally, right (when standalone)
- Detect false starts: same sentence root repeated within 5s → keep second occurrence
- Add 0.05s padding before each kept segment to avoid hard cuts
- Output Python list CUTS = [(start, end, reason), ...] then build_ffmpeg_cmd() function
- Print total removed duration at end

Stage 3: Illustration brief

Read pipeline/01_transcript/transcript_cut.json. Produce pipeline/03_briefs/illustrations.json.

For each conceptual explanation in the transcript:
- Identify the timestamp where the concept is introduced
- Determine insertion mode: fullscreen (complex diagram, speaker pauses), 
  picture_in_picture (speaker continues talking), lower_third (quick definition/stat)
- Write detailed element list for the animation
- Specify duration: match natural speech pause or concept explanation length
- Flag any moments where code appears: always use lower_third or fullscreen code reveal

Output strict JSON array. 6-10 illustrations for a 10-minute video is normal.

Stage 4: HyperFrames generation

Generate a HyperFrames HTML composition for this illustration brief:
[paste single illustration JSON object]

Requirements:
- GSAP timeline, paused, registered as window.__hfAnime = (t) => tl.seek(t)
- 1920x1080px canvas
- Dark background (#0f1117)
- Blue accent (#60a5fa), green positive path (#4ade80), orange negative (#fb923c)
- JetBrains Mono for code/labels, Inter for prose
- Sequential reveal synchronized to expected speech timing
- Duration: [X]s total
- Mode: [fullscreen|picture_in_picture|lower_third]
Output complete self-contained HTML file.

Human Effort Summary

Step	Human action	Time
Record	Film talking head	20–40 min
Stage 2 review	Watch cut.mp4 at 2×	5 min
Stage 4 review	Check each animation	10 min
Stage 6 final	Watch polished.mp4, pick thumbnail	5 min
Total editing		~20 min

Traditional workflow: 4–8 hours.

Build Order (First-Time Setup)

brew install ffmpeg whisper — core tools
npm install -g @hyperframes/cli — animation renderer
Build Stage 1+2 first — biggest time saving, works immediately
Build illustration brief prompt — no code, just Claude prompt engineering
Build HyperFrames template library — 6 reusable types (diagram, flowchart, code, chart, lower-third, callout), generate once, re-prompt per video
Build CapCut JSON writer — Claude Code reads project format, writes directly
Build thumbnail pipeline last — polish step, lowest leverage

Huy's Wiki

Explorer

youtube-talking-head-ai-pipeline

YouTube Talking Head Video: Minimal-Human AI Pipeline

Architecture Overview

Tool Stack

Optional Upgrades

Stage 1 — Ingest & Transcribe

Command

Output: transcript.json

What Claude reads next

Stage 2 — Smart Cutting

Claude generates EDL

Output: pipeline/02_cuts/cut.mp4

Human checkpoint

Stage 3 — Illustration Planning

Claude produces illustration brief

Insertion modes

Stage 4 — Illustration Generation

HyperFrames pipeline per illustration

Example HyperFrames output for ill_001 (animated diagram)

Animation types Claude generates

For photorealistic assets

Human checkpoint

Stage 5 — Assembly

Claude writes CapCut project JSON

Alternative: pure FFmpeg assembly (no CapCut)

Captions

Stage 6 — Polish

Color grading

BGM placement

SFX placement

Thumbnail

Directory Structure

Claude Code Prompts (Copy-Paste)

Stage 2: Cut analysis

Stage 3: Illustration brief

Stage 4: HyperFrames generation

Human Effort Summary

Build Order (First-Time Setup)

Tags

Graph View

Table of Contents

Output: `transcript.json`

Output: `pipeline/02_cuts/cut.mp4`

Example HyperFrames output for `ill_001` (animated diagram)