Extraction that’s not naive
HTML → normalized text, with fallbacks when a site fights back.
Cheerio + sanitization + optional Firecrawl
YouTube transcripts, without drama
Pull captions via multiple paths so “missing transcript” becomes rare.
Website vs YouTube modes
Built for pipelines
--extract-only, --json, and --metrics make it scriptable.
Compose it with your own tools
Provider-agnostic LLMs
OpenAI, xAI, Gemini — use what you’ve got, pin what you need.
Control via
--model + env vars