YouTube Talking Head Video: Minimal-Human AI Pipeline
A complete production workflow for talking head YouTube videos with illustrating animations. Works from raw footage. Target: ~20 minutes human effort per video.
Architecture Overview
raw.mp4
└─ Stage 1: Whisper → timestamped transcript
└─ Stage 2: Claude EDL → FFmpeg cuts → cut.mp4
└─ Stage 3: Claude illustration brief → per-segment JSON
└─ Stage 4: HyperFrames → overlay animations
└─ Stage 5: Claude writes CapCut project JSON
└─ Stage 6: Polish (color, SFX, BGM, thumbnail)
└─ published.mp4
Human checkpoints: after Stage 2 (5 min), after Stage 4 (10 min), after Stage 6 (5 min).
Tool Stack
| Layer | Tool | Why |
|---|---|---|
| Transcription | Whisper large-v3 | Best word-level timestamps, local, free |
| Video processing | FFmpeg | Universal, scriptable, no GUI |
| Orchestration | Claude Code | Reads transcript, generates all downstream artifacts |
| Animation | HyperFrames (HeyGen) | HTML → MP4, agent-native, GSAP/Three.js/Lottie |
| Visual assets | Claude Design (claude.ai/design) | Motion graphics with green screen bg |
| Assembly | CapCut project file (JSON) | Claude writes it directly, no GUI clicks |
| Captions | Whisper SRT + FFmpeg | Auto-generated, burned or soft |
| Thumbnail | Claude Design + FFmpeg frame extract | 3 variants from best frame |
Optional Upgrades
| Scenario | Swap |
|---|---|
| Professional color grading | DaVinci Resolve (Claude writes .drp nodes) |
| Complex 3D illustrations | Three.js inside HyperFrames |
| Data-driven charts | Remotion (React components) |
| Photorealistic stock visuals | Seedance / Kling via API |
Stage 1 — Ingest & Transcribe
Human time: 0 min. Runs automatically.
Command
whisper raw.mp4 \
--model large-v3 \
--output_format json \
--word_timestamps true \
--output_dir ./pipeline/01_transcript/Output: transcript.json
{
"segments": [
{
"id": 0,
"start": 0.0,
"end": 4.2,
"text": "Today I want to talk about HTTP caching.",
"words": [
{"word": "Today", "start": 0.0, "end": 0.3, "probability": 0.99},
{"word": "I", "start": 0.32, "end": 0.38, "probability": 0.99},
...
]
}
]
}What Claude reads next
Claude Code ingests transcript.json and flags every word/segment for:
- Silence gaps > 0.4s between words → cut candidate
- Filler words (
uh,um,like,you know,sort of,basically,literally,right?) → cut candidate - False starts (sentence repeated within 5s of itself) → keep second take, cut first
- Topic shifts (semantic analysis) → illustration insertion point marker
- Key terms (nouns, concepts mentioned for first time) → callout annotation candidate
Stage 2 — Smart Cutting
Human time: 5 min spot-check.
Claude generates EDL
Claude Code produces a Python script that builds FFmpeg filter complex from flagged regions:
# pipeline/02_cuts/generate_edl.py
# Claude writes this based on transcript analysis
CUTS = [
# (keep_start, keep_end, reason)
(0.00, 4.20, "intro"),
(5.80, 12.10, "skip filler 4.2-5.8: 'uh...um...'"),
(12.10, 18.45, "keep"),
(19.20, 31.00, "skip false start 18.45-19.2"),
...
]
def build_ffmpeg_cmd(cuts, input_file, output_file):
segments = []
filter_parts = []
for i, (start, end, _) in enumerate(cuts):
filter_parts.append(
f"[0:v]trim=start={start}:end={end},setpts=PTS-STARTPTS[v{i}];"
f"[0:a]atrim=start={start}:end={end},asetpts=PTS-STARTPTS[a{i}];"
)
segments.append(f"[v{i}][a{i}]")
concat = "".join(segments) + f"concat=n={len(cuts)}:v=1:a=1[vout][aout]"
filter_complex = "".join(filter_parts) + concat
return [
"ffmpeg", "-i", input_file,
"-filter_complex", filter_complex,
"-map", "[vout]", "-map", "[aout]",
"-c:v", "libx264", "-crf", "18",
"-c:a", "aac", "-b:a", "192k",
output_file
]Output: pipeline/02_cuts/cut.mp4
Clean talking head, no dead air, no filler. Transcript updated with new timestamps (transcript_cut.json).
Human checkpoint
Watch at 2× speed. Reject any cut that broke a sentence. Re-run with adjusted thresholds.
Stage 3 — Illustration Planning
Human time: 0 min.
Claude produces illustration brief
Claude Code reads transcript_cut.json, performs semantic analysis, outputs:
// pipeline/03_briefs/illustrations.json
[
{
"id": "ill_001",
"timestamp_in": "00:00:45.2",
"timestamp_out": "00:00:53.0",
"duration_s": 7.8,
"mode": "picture_in_picture",
"concept": "HTTP request-response cycle",
"trigger_quote": "...the browser sends a request and waits for the server to respond...",
"illustration_type": "animated_diagram",
"elements": [
"browser icon on left",
"server icon on right",
"arrow labeled GET /page flying left-to-right",
"arrow labeled 200 OK flying right-to-left",
"timestamps on arrows"
],
"style": "dark background, monospace labels, blue accent color",
"animation": "sequential reveal, arrows draw in sync with speech"
},
{
"id": "ill_002",
"timestamp_in": "00:02:10.0",
"timestamp_out": "00:02:15.5",
"duration_s": 5.5,
"mode": "lower_third",
"concept": "Cache-Control header definition",
"trigger_quote": "...Cache-Control is the header that tells browsers how long to store...",
"illustration_type": "callout",
"elements": [
"term: Cache-Control",
"definition: HTTP header controlling caching behavior",
"example: Cache-Control: max-age=3600"
],
"style": "bottom 25% of frame, semi-transparent bg, monospace font"
},
{
"id": "ill_003",
"timestamp_in": "00:04:30.0",
"timestamp_out": "00:04:45.0",
"duration_s": 15.0,
"mode": "fullscreen",
"concept": "Cache hit vs cache miss flow",
"trigger_quote": "...so on a cache hit, the browser never contacts the server at all...",
"illustration_type": "flowchart",
"elements": [
"START: Browser needs resource",
"DIAMOND: In cache? + not expired?",
"YES path → Return cached → END",
"NO path → Fetch from server → Store in cache → Return → END"
],
"style": "dark bg, green for cache hit path, orange for miss path"
}
]Insertion modes
| Mode | When to use | Implementation |
|---|---|---|
fullscreen | Complex diagram needs full attention, speaker pauses | Replace talking head for duration |
picture_in_picture | Speaker continues talking while showing concept | Talking head shrinks to corner (20% size), diagram fills 80% |
lower_third | Quick definition, stat, or code snippet | Bottom 25% overlay, transparent bg, speaker visible |
side_by_side | Comparison, before/after | Frame splits 50/50 |
Stage 4 — Illustration Generation
Human time: 10 min review + prompt-fix.
HyperFrames pipeline per illustration
Claude Code reads each item in illustrations.json and generates a HyperFrames HTML composition:
# For each illustration brief:
claude --print "Generate a HyperFrames HTML composition for: $(cat ill_001.json)" \
> pipeline/04_animations/ill_001.html
npx hyperframes render pipeline/04_animations/ill_001.html \
--output pipeline/04_animations/ill_001.mp4 \
--duration 7.8 \
--fps 60 \
--width 1920 --height 1080Example HyperFrames output for ill_001 (animated diagram)
<!DOCTYPE html>
<html>
<head>
<script src="https://cdnjs.cloudflare.com/ajax/libs/gsap/3.12.0/gsap.min.js"></script>
<style>
body { margin: 0; background: #0f1117; font-family: 'JetBrains Mono', monospace; }
.stage { width: 1920px; height: 1080px; position: relative; display: flex;
align-items: center; justify-content: space-between; padding: 200px; }
.node { display: flex; flex-direction: column; align-items: center; gap: 16px; opacity: 0; }
.icon { width: 120px; height: 120px; }
.label { color: #94a3b8; font-size: 24px; }
.arrow-container { position: absolute; width: 600px; left: 50%; transform: translateX(-50%); }
.arrow { opacity: 0; }
.arrow-label { color: #60a5fa; font-size: 20px; }
</style>
</head>
<body>
<div class="stage">
<div class="node" id="browser">
<img class="icon" src="browser-icon.svg">
<span class="label">Browser</span>
</div>
<div class="arrow-container">
<div class="arrow" id="req-arrow">→ GET /page ───────────────</div>
<div class="arrow" id="res-arrow">←────────────── 200 OK ←</div>
</div>
<div class="node" id="server">
<img class="icon" src="server-icon.svg">
<span class="label">Server</span>
</div>
</div>
<script>
// HyperFrames seek-driven: register on window.__hfAnime
const tl = gsap.timeline({ paused: true });
tl.to('#browser', { opacity: 1, duration: 0.4 })
.to('#server', { opacity: 1, duration: 0.4 }, '<')
.to('#req-arrow', { opacity: 1, duration: 0.6 })
.to('#res-arrow', { opacity: 1, duration: 0.6 }, '+=0.8');
window.__hfAnime = (t) => { tl.seek(t); };
</script>
</body>
</html>Animation types Claude generates
| Type | Technology | Use case |
|---|---|---|
| Flowchart / DAG | SVG + GSAP path draw | Decision trees, processes, flows |
| Code reveal | <pre> + GSAP stagger | Step-by-step code explanation |
| Data chart | Canvas / D3-lite | Stats, comparisons, growth |
| Network diagram | SVG circles + lines | Architecture, APIs, relationships |
| Lower third | CSS + GSAP slide-in | Term definitions, stats, quotes |
| 3D model | Three.js | Conceptual objects, spatial relationships |
| Lottie animation | Lottie player | Icons, micro-animations |
| Text kinetic | GSAP SplitText | Key phrases, emphasis |
For photorealistic assets
Claude Design prompt:
"Create a motion graphic showing [concept] on a pure green screen (#00ff00) background.
Animate in from left, hold 5s, animate out right. Export as single continuous video.
Style: [dark/light], [color palette]."
→ Download from Claude Design
→ FFmpeg chroma key: -vf "chromakey=0x00ff00:0.1:0.2"
→ Composite over talking head
Human checkpoint
Open each ill_*.mp4. For any that miss the concept, write a corrected brief and re-run Claude + HyperFrames. Typically 1–2 iterations on complex diagrams.
Stage 5 — Assembly
Human time: 0 min.
Claude writes CapCut project JSON
CapCut stores projects as a JSON file (draft_content.json) in ~/Movies/CapCut/User Data/Projects/com.lveditor.draft/<project-id>/.
Claude Code generates this file directly:
{
"tracks": [
{
"type": "video",
"segments": [
{
"material_id": "cut_mp4",
"target_timerange": {"start": 0, "duration": 3920000},
"source_timerange": {"start": 0, "duration": 3920000}
}
]
},
{
"type": "video",
"segments": [
{
"material_id": "ill_001_mp4",
"target_timerange": {"start": 2720000, "duration": 780000},
"extra_material_refs": ["pip_effect"],
"clip": {
"scale": {"x": 0.2, "y": 0.2},
"position": {"x": 0.75, "y": 0.75}
}
},
{
"material_id": "ill_002_mp4",
"target_timerange": {"start": 7800000, "duration": 550000},
"extra_material_refs": ["lower_third_effect"]
}
]
},
{
"type": "text",
"segments": [] // captions injected from SRT below
},
{
"type": "audio",
"segments": [
{
"material_id": "bgm_mp3",
"target_timerange": {"start": 0, "duration": 3920000},
"volume": 0.12
}
]
}
],
"materials": {
"videos": [
{"id": "cut_mp4", "path": "/pipeline/02_cuts/cut.mp4"},
{"id": "ill_001_mp4", "path": "/pipeline/04_animations/ill_001.mp4"},
{"id": "ill_002_mp4", "path": "/pipeline/04_animations/ill_002.mp4"}
],
"audios": [
{"id": "bgm_mp3", "path": "/pipeline/06_polish/bgm.mp3"}
]
}
}CapCut timestamps use microseconds (1s = 1,000,000 units). Claude converts from seconds automatically.
Alternative: pure FFmpeg assembly (no CapCut)
For fully headless pipeline:
# Claude generates this ffmpeg command
ffmpeg \
-i pipeline/02_cuts/cut.mp4 \
-i pipeline/04_animations/ill_001.mp4 \
-i pipeline/04_animations/ill_003.mp4 \
-i pipeline/06_polish/bgm.mp3 \
-filter_complex "
[0:v][1:v]overlay=W*0.75:H*0.75:enable='between(t,45.2,53.0)'[v1];
[v1][2:v]overlay=0:0:enable='between(t,270.0,285.0)'[v2];
[0:a]volume=1[speech];
[3:a]volume=0.12[bgm];
[speech][bgm]amix=inputs=2:duration=first[aout]
" \
-map "[v2]" -map "[aout]" \
-c:v libx264 -crf 16 -preset slow \
-c:a aac -b:a 192k \
pipeline/05_assembly/assembled.mp4Captions
# Whisper already produced transcript.json
# Convert to SRT:
python -c "
import json, sys
data = json.load(open('pipeline/01_transcript/transcript.json'))
for i, seg in enumerate(data['segments'], 1):
start = seg['start']; end = seg['end']
print(f'{i}')
print(f'{int(start//3600):02}:{int((start%3600)//60):02}:{start%60:06.3f}'.replace('.',',') +
' --> ' +
f'{int(end//3600):02}:{int((end%3600)//60):02}:{end%60:06.3f}'.replace('.',','))
print(seg['text'].strip())
print()
" > pipeline/05_assembly/captions.srt
# Burn into video:
ffmpeg -i assembled.mp4 -vf "subtitles=captions.srt:force_style='FontName=Inter,FontSize=28,Bold=1,Outline=2'" -c:a copy final.mp4Stage 6 — Polish
Human time: 5 min final review.
Color grading
Claude Code generates a LUT or CapCut color node:
Claude prompt: "Generate a color grade LUT for a talking head video shot
in a home office. Warm skin tones, slight contrast boost, teal shadows.
Output as .cube format."
Or for DaVinci Resolve, Claude writes node parameters directly to .drp file.
BGM placement
# Claude selects BGM track from royalty-free library
# Ducks under speech using FFmpeg sidechain compression
ffmpeg -i assembled.mp4 -i bgm.mp3 -filter_complex "
[1:a]volume=0.15[bgm];
[0:a][bgm]sidechaincompress=
threshold=0.02:ratio=4:attack=200:release=1000[aout]
" -map 0:v -map "[aout]" -c:v copy polished.mp4SFX placement
Claude reads transcript, identifies transition moments (topic shifts, cuts between sections), places SFX at exact timestamps:
SFX_MAP = [
(0.0, "pipeline/sfx/intro_whoosh.mp3"),
(45.2, "pipeline/sfx/transition_tone.mp3"), # illustration in
(270.0, "pipeline/sfx/section_shift.mp3"), # new topic
]Thumbnail
# Extract best frame (Claude picks timestamp from transcript - hook moment)
ffmpeg -i cut.mp4 -ss 00:00:08 -vframes 1 pipeline/06_polish/best_frame.png
# Claude Design generates 3 thumbnail variants using best_frame.png as reference
# Human picks oneDirectory Structure
project/
├── raw.mp4 # input
├── pipeline/
│ ├── 01_transcript/
│ │ ├── transcript.json # Whisper output
│ │ └── transcript_cut.json # updated after cuts
│ ├── 02_cuts/
│ │ ├── edl.py # Claude-generated cut script
│ │ └── cut.mp4 # clean talking head
│ ├── 03_briefs/
│ │ └── illustrations.json # Claude-generated brief
│ ├── 04_animations/
│ │ ├── ill_001.html # HyperFrames source
│ │ ├── ill_001.mp4 # rendered animation
│ │ └── ...
│ ├── 05_assembly/
│ │ ├── captions.srt
│ │ ├── draft_content.json # CapCut project file
│ │ └── assembled.mp4
│ └── 06_polish/
│ ├── bgm.mp3
│ ├── polished.mp4
│ └── thumbnails/
│ ├── variant_a.png
│ ├── variant_b.png
│ └── variant_c.png
└── published.mp4 # final output
Claude Code Prompts (Copy-Paste)
Stage 2: Cut analysis
Read pipeline/01_transcript/transcript.json. Produce pipeline/02_cuts/edl.py.
Rules:
- Remove silence gaps > 0.4s between words
- Remove these filler words: uh, um, like, you know, sort of, basically, literally, right (when standalone)
- Detect false starts: same sentence root repeated within 5s → keep second occurrence
- Add 0.05s padding before each kept segment to avoid hard cuts
- Output Python list CUTS = [(start, end, reason), ...] then build_ffmpeg_cmd() function
- Print total removed duration at end
Stage 3: Illustration brief
Read pipeline/01_transcript/transcript_cut.json. Produce pipeline/03_briefs/illustrations.json.
For each conceptual explanation in the transcript:
- Identify the timestamp where the concept is introduced
- Determine insertion mode: fullscreen (complex diagram, speaker pauses),
picture_in_picture (speaker continues talking), lower_third (quick definition/stat)
- Write detailed element list for the animation
- Specify duration: match natural speech pause or concept explanation length
- Flag any moments where code appears: always use lower_third or fullscreen code reveal
Output strict JSON array. 6-10 illustrations for a 10-minute video is normal.
Stage 4: HyperFrames generation
Generate a HyperFrames HTML composition for this illustration brief:
[paste single illustration JSON object]
Requirements:
- GSAP timeline, paused, registered as window.__hfAnime = (t) => tl.seek(t)
- 1920x1080px canvas
- Dark background (#0f1117)
- Blue accent (#60a5fa), green positive path (#4ade80), orange negative (#fb923c)
- JetBrains Mono for code/labels, Inter for prose
- Sequential reveal synchronized to expected speech timing
- Duration: [X]s total
- Mode: [fullscreen|picture_in_picture|lower_third]
Output complete self-contained HTML file.
Human Effort Summary
| Step | Human action | Time |
|---|---|---|
| Record | Film talking head | 20–40 min |
| Stage 2 review | Watch cut.mp4 at 2× | 5 min |
| Stage 4 review | Check each animation | 10 min |
| Stage 6 final | Watch polished.mp4, pick thumbnail | 5 min |
| Total editing | ~20 min |
Traditional workflow: 4–8 hours.
Build Order (First-Time Setup)
brew install ffmpeg whisper— core toolsnpm install -g @hyperframes/cli— animation renderer- Build Stage 1+2 first — biggest time saving, works immediately
- Build illustration brief prompt — no code, just Claude prompt engineering
- Build HyperFrames template library — 6 reusable types (diagram, flowchart, code, chart, lower-third, callout), generate once, re-prompt per video
- Build CapCut JSON writer — Claude Code reads project format, writes directly
- Build thumbnail pipeline last — polish step, lowest leverage
Tags
#VideoEditing #YouTube #TalkingHead #Whisper #FFmpeg #HyperFrames #CapCut #ClaudeCode #Automation #ContentCreation