Transcript Timestamps Plan
Short scope
- Add
--timestampsflag to request timed transcripts. - Preserve existing plain transcript text; add structured segments + timed text.
- Chat mode: include timed transcript + prompt for
[mm:ss]references. - Sidepanel: click timestamp → seek media (video or audio), keep play state.
- Coverage: YouTube, podcasts, embedded captions, generic media; whisper.cpp = no segments unless we add verbose output later.
#1) API / data model
- New option:
FetchLinkContentOptions.transcriptTimestamps?: boolean. - Thread through provider options (
ProviderFetchOptions). - New types:
TranscriptSegment:{ startMs: number; endMs?: number | null; text: string }.TranscriptResolution.segments?: TranscriptSegment[] | null.ExtractedLinkContent.transcriptSegments?: TranscriptSegment[] | null.ExtractedLinkContent.transcriptTimedText?: string | null(helper).- Keep
TranscriptResolution.textunchanged (plain transcript).
Notes
--timestampsshould only alter output when requested; default output remains stable.- For JSON output, include both
transcriptSegmentsandtranscriptTimedTextwhen requested.
#2) Provider updates
YouTube (youtubei)
- Parse
startMs(and duration if present) fromtranscriptSegmentRenderer. - Build segments array;
textstill plain (join of text).
YouTube (captionTracks json3 / xml)
- json3 provides
events[].tStartMsanddDurationMs; parse segments fromevents.segs[].utf8. - XML captions include
start+dur; parse segments when present.
Podcast RSS transcripts
- VTT parser should output segments (start/end + cue text).
- JSON transcript: support
segmentswithstart/startMs+end/endMs+text. - Plain text transcripts:
segments = null.
Generic embedded captions
- When track is VTT/JSON, parse into segments; otherwise
null.
yt-dlp / whisper / whisper.cpp
- Keep
segments = null(plain text only). - Optional future: request verbose or SRT output from OpenAI/FAL when supported.
#3) Cache behavior
- Store
segmentsin transcript metadata (or dedicated cache field). - If
--timestampsand cached transcript lacks segments, treat as miss and refetch. - Keep cache keys stable; only bypass when timestamps requested.
#4) CLI / daemon
- Add
--timestampsto CLI help + config. - Map to
FetchLinkContentOptions.transcriptTimestamps. --extract --json: includetranscriptSegments+transcriptTimedText.- Non-JSON extract: keep plain transcript unless
--timestamps, then output timed text block.
#5) Chat prompt + content
buildChatPageContent: when timestamps requested, includeTimed transcript:block using[mm:ss].buildChatSystemPrompt: add instruction:- “When referencing moments, include
[mm:ss]timestamps from the transcript.”
#6) Chrome extension UI
Render
- Linkify
[mm:ss]and[hh:mm:ss]in assistant messages. - Convert to
timestamp:<seconds>hrefs (or data attribute).
Seek handler
- On click: prevent default, parse seconds, send
panel:seek→ background → content script. - Content script:
- Find
<video>or<audio>. - Record
wasPaused = media.paused. media.currentTime = seconds.- If
!wasPaused, callmedia.play(); else do nothing. - YouTube fallback when no media element:
- If
window.ytplayer/ YT IFrame API available,player.seekTo(seconds, true).
#7) Tests
Core
- youtubei transcript parsing yields segments + plain text.
- captionTracks json3 + xml yield segments.
- VTT parser yields segments.
- Cache: timestamps requested + cached without segments → refetch.
Daemon / CLI
--timestampspropagates into fetch options.- JSON extract includes
transcriptSegments+transcriptTimedText.
Chrome extension
- Chat content includes timed transcript when requested.
- Sidepanel: timestamp link emits
panel:seek. - Content script seek: playing stays playing, paused stays paused; audio + video.
#8) Changelog
- Entry:
--timestampsflag, timed transcripts in chat, clickable timestamps in extension, podcast support.
#9) Notes / open
- “VisPoR” = whisper.cpp: no timestamps unless we add verbose output path.
- Decide exact format of
transcriptTimedText(recommend[mm:ss] textper line).