Skip to content

Conversation

@toubatbrian
Copy link
Contributor

@toubatbrian toubatbrian commented Jan 22, 2026

Summary

Implements TTS aligned transcripts feature for the Node.js Agents. This feature enables word-level timestamp synchronization between TTS audio and displayed transcripts.

Changes

Core Implementation

  • Added TimedString interface for text with timing information (startTime, endTime)
  • Implemented createTimedString factory function
  • Added TTSCapabilities.alignedTranscript flag to indicate TTS provider support
  • Implemented performTTSInference utility that extracts timed texts from TTS audio frames
  • Updated TranscriptionSynchronizer to use SpeakingRateData for interpolating word timing from annotations
  • Added USERDATA_TIMED_TRANSCRIPT constant for attaching timed transcripts to audio frame userdata

ElevenLabs Plugin

  • Added syncAlignment option (defaults to true)
  • Implemented toTimedWords to parse ElevenLabs alignment data into TimedString objects
  • Fixed timestamp normalization by subtracting firstWordOffsetMs to handle absolute timestamps

Cartesia Plugin

  • Created types.ts with Zod schemas for Cartesia WebSocket API messages (chunk, timestamps, done, flush_done, error)
  • Refactored WebSocket handling to use createStreamChannel pattern, preventing message loss during listener re-registration
  • Implemented word timestamp extraction from word_timestamps messages
  • Added wordTimestamps option (defaults to true)

Voice Options

  • Moved useTtsAlignedTranscript from session-level parameter into VoiceOptions interface
  • Changed default value to true
  • Agent-level setting takes precedence over session-level setting

Stream Adapter

  • Updated StreamAdapter to create TimedString with cumulative duration for non-streaming TTS providers
  • Set alignedTranscript: true capability for adapted streams

Agent Activity

  • Updated transcription input source selection logic to use timed texts stream when useTtsAlignedTranscript is enabled and TTS supports it

Test plan

  • Verifed ElevenLabs TTS word timestamps sync with audio
  • Verifed Cartesia TTS word timestamps sync with audio
  • Verified non-streaming TTS with StreamAdaptor sync with audio
  • Confirm no transcript lag during extended agent turns

Summary by CodeRabbit

  • New Features

    • Timed (word-level) transcripts added and exposed across voice/real-time flows.
    • Optional "TTS-aligned transcripts" enabled by default for voice sessions to align text with audio.
    • TTS providers and realtime integrations now emit per-word timing alongside audio frames for precise sync.
  • Tests

    • New unit tests covering timing and synchronization logic to validate aligned transcript behavior.

✏️ Tip: You can customize this high-level summary in your review settings.

@changeset-bot
Copy link

changeset-bot bot commented Jan 22, 2026

🦋 Changeset detected

Latest commit: 6f37f02

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 18 packages
Name Type
@livekit/agents-plugin-elevenlabs Patch
@livekit/agents-plugin-cartesia Patch
@livekit/agents-plugin-deepgram Patch
@livekit/agents-plugin-openai Patch
@livekit/agents Patch
@livekit/agents-plugin-anam Patch
@livekit/agents-plugin-google Patch
@livekit/agents-plugin-inworld Patch
@livekit/agents-plugin-neuphonic Patch
@livekit/agents-plugin-resemble Patch
@livekit/agents-plugin-rime Patch
@livekit/agents-plugin-xai Patch
@livekit/agents-plugin-baseten Patch
@livekit/agents-plugin-bey Patch
@livekit/agents-plugin-hedra Patch
@livekit/agents-plugin-livekit Patch
@livekit/agents-plugin-silero Patch
@livekit/agents-plugins-test Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@chatgpt-codex-connector
Copy link

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

@coderabbitai
Copy link

coderabbitai bot commented Jan 22, 2026

📝 Walkthrough

Walkthrough

This PR adds a TimedString primitive and pipes TTS-aligned, word-level timing data throughout agents, updating TTS/STT plugins, voice session/agent APIs, generation and synchronization logic, and streaming paths to accept and forward ReadableStream<string | TimedString>.

Changes

Cohort / File(s) Summary
Core TimedString API
agents/src/voice/io.ts, agents/src/types.ts, agents/src/index.ts
Introduces TimedString marker, createTimedString, isTimedString, and USERDATA_TIMED_TRANSCRIPT; exports TimedString utilities in public index.
Agent & Session Config
agents/src/voice/agent_session.ts, agents/src/voice/agent.ts, agents/src/voice/index.ts
Adds useTtsAlignedTranscript option (default true), surfaces VoiceOptions, propagates option through Agent/AgentSession, updates transcription node signatures to accept `ReadableStream<string
TTS core & stream adapter
agents/src/tts/tts.ts, agents/src/tts/stream_adapter.ts
Adds timedTranscripts?: TimedString[] and alignedTranscript?: boolean; StreamAdapter initialized with alignedTranscript: true and attaches timed transcripts to audio frames using cumulative duration tracking.
Generation & TTS inference
agents/src/voice/generation.ts
Adds _TTSGenerationData (audioStream + timedTextsFut), changes performTTSInference to accept/return timed streams, integrates createTimedString/isTimedString, and forwards timed transcripts via timedTextsFut.
Synchronization & timing logic
agents/src/voice/transcription/synchronizer.ts, agents/src/voice/transcription/synchronizer.test.ts
Adds SpeakingRateData, extends AudioData, makes pushText/captureText accept `string
Realtime / LLM text stream
agents/src/llm/realtime.ts, plugins/openai/src/realtime/realtime_model.ts, plugins/openai/src/realtime/realtime_model_beta.ts, plugins/openai/src/realtime/api_proto.ts
Changes message/text channels to stream `string
STT integrations
agents/src/inference/stt.ts, plugins/deepgram/src/stt.ts, plugins/deepgram/src/stt_v2.ts
Use createTimedString(...) to produce per-word timed entries instead of inline objects; import createTimedString where needed.
TTS plugins — Cartesia & ElevenLabs
plugins/cartesia/src/tts.ts, plugins/cartesia/src/types.ts, plugins/elevenlabs/src/tts.ts
Cartesia: adds wordTimestamps option, Zod schemas, buffered event channel, emits timed transcripts with audio frames. ElevenLabs: introduces timestamp normalization (firstWordOffsetMs) and attaches timed transcripts to frames.
Example & deps
examples/src/basic_agent.ts, plugins/cartesia/package.json
Example enables useTtsAlignedTranscript; Cartesia plugin adds zod peer dep.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Agent
    participant TTS
    participant Sync as TextSynchronizer
    participant Transcriber

    Client->>Agent: audio input / TTS request
    Agent->>TTS: forward text stream (string|TimedString) with useTtsAlignedTranscript
    TTS->>Agent: audio frames (AudioFrame) + timedTranscripts (TimedString[])
    Agent->>Sync: attach timedTranscripts via USERDATA_TIMED_TRANSCRIPT
    Sync->>Sync: compute SpeakingRateData / annotated rates
    Sync->>Transcriber: emit synchronized text (string|TimedString)
    Transcriber->>Client: synchronized transcript output
Loading
sequenceDiagram
    participant PluginWS as TTS Plugin WS
    participant Parser
    participant Accumulator
    participant FrameOut as Audio Frame Emitter
    participant Consumer as Agent/Client

    PluginWS->>Parser: receive websocket messages
    Parser->>Accumulator: validate & extract word timestamps (hasWordTimestamps)
    Accumulator->>FrameOut: batch into TimedString objects, attach to frames
    FrameOut->>Consumer: emit AudioFrame + timedTranscripts
    Consumer->>Sync: consume timedTranscripts for alignment
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested reviewers

  • lukasIO
  • chenghao-mou
  • davidzhao
  • theomonnom

Poem

🐰 I nibble bytes and count each beat,

words march in time with every tweet.
Frames and phrases now hold hands tight,
sync'd and snug from morning to night.
Hooray — timed transcripts take flight!

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 57.14% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat: implement TTS aligned transcripts' accurately and concisely summarizes the main feature added in this PR, which is clearly the implementation of TTS-aligned transcripts enabling word-level timestamp synchronization.
Description check ✅ Passed The PR description provides a comprehensive summary of changes, well-organized implementation details across multiple files, plugin-specific changes, and a clear test plan with verification steps.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
agents/src/voice/transcription/synchronizer.ts (1)

655-670: flush() changed to async but parent class defines it as synchronous - interface violation.

The flush() method in SyncedTextOutput is now async but the parent class TextOutput defines abstract flush(): void; (synchronous). This breaks the Liskov Substitution Principle and creates interface incompatibility.

Additionally, multiple callers do not await this method:

  • agents/src/voice/generation.ts lines 693, 792
  • agents/src/voice/room_io/room_io.ts line 316
  • agents/src/voice/transcription/synchronizer.ts lines 318, 660 (even within the new async method)

Either make the parent flush() async or keep SyncedTextOutput.flush() synchronous and handle synchronization differently.

examples/src/basic_agent.ts (1)

4-15: Initialize logger before LLM usage.

Examples must call initializeLogger({ pretty: true }) before any LLM functionality.

🛠️ Suggested fix
 import {
   type JobContext,
   type JobProcess,
   WorkerOptions,
   cli,
   defineAgent,
+  initializeLogger,
   llm,
   metrics,
   voice,
 } from '@livekit/agents';
@@
 import { fileURLToPath } from 'node:url';
 import { z } from 'zod';
 
+initializeLogger({ pretty: true });
+
 export default defineAgent({

As per coding guidelines, please initialize the logger in examples before using LLMs.

🤖 Fix all issues with AI agents
In `@agents/src/tts/stream_adapter.ts`:
- Around line 106-124: The audio frame write assumes audio.frame.userdata is
always defined and replaces the timed-transcript array, which can throw if
userdata is undefined and clobber existing entries; before assigning, ensure
audio.frame.userdata exists (e.g., audio.frame.userdata = audio.frame.userdata
?? {}) and safely append the timedString instead of replacing: read existing
array at audio.frame.userdata[USERDATA_TIMED_TRANSCRIPT] (or default to []),
push timedString, then reassign that array back to
audio.frame.userdata[USERDATA_TIMED_TRANSCRIPT]; apply the same defensive
initialization and append logic to the identical pattern in the other occurrence
that writes USERDATA_TIMED_TRANSCRIPT (the block using createTimedString /
isFirstFrame and audio.frame.userdata).

In `@agents/src/voice/agent_activity.ts`:
- Around line 1419-1434: The current await on ttsGenData.timedTextsFut.await can
hang if the TTS task fails before resolving; update the block that checks
useTtsAlignedTranscript / tts?.capabilities.alignedTranscript / ttsGenData to
race the timedTextsFut.await with a failure/timeout signal from the TTS task (or
ensure the TTS path always resolves/rejects the future on error). Specifically,
when handling ttsGenData and calling timedTextsFut.await, use a Promise.race (or
equivalent) between timedTextsFut.await and a fallback that rejects or times out
if the TTS generation task fails (or if ttsGenData exposes a task/error
promise), then log and fall back to llmOutput if the race returns an
error/timeout so the transcriptionInput assignment never deadlocks.

In `@agents/src/voice/generation.ts`:
- Around line 579-614: The variable initialPushedDuration is unused because
pushedDuration is initialized to 0 per performTTSInference call; remove the
unused offset logic or mark it as intentional scaffolding: either delete
initialPushedDuration and the + initialPushedDuration adjustments in the
createTimedString call (and its comment), or replace the comment with a TODO
stating this is reserved for multi-inference offsets and keep
initialPushedDuration so future callers can pass a non-zero pushedDuration;
update references in performTTSInference, the createTimedString call that
adjusts startTime/endTime, and any related comment near timedTextsWriter
accordingly.

In `@examples/src/timed_transcript_agent.ts`:
- Around line 23-32: Add a call to initializeLogger({ pretty: true }) at the top
of the module before any LLM-related imports/usage (i.e., before references to
llm, defineAgent, stream, voice, etc.); specifically, place the initialization
right after module imports and before any code that calls or constructs llm or
defineAgent so the logger is configured prior to LLM tooling being used.

In `@plugins/cartesia/src/tts.ts`:
- Around line 405-411: The debug log message in the TTS chunk timeout handler
incorrectly says "STT chunk stream"; update the string passed to
this.#logger.debug inside the timeout callback (the block that sets timeout =
setTimeout(...)) to say "TTS chunk stream timeout after
${this.#opts.chunkTimeout}ms" so the log accurately reflects TTS, leaving the
rest of the timeout closure (including ws.close()) unchanged.

In `@plugins/cartesia/src/types.ts`:
- Around line 1-3: Update the SPDX copyright header at the top of the file by
changing the year in the SPDX-FileCopyrightText line from 2024 to 2025; locate
the SPDX-FileCopyrightText entry (the comment starting with
"SPDX-FileCopyrightText: 2024 LiveKit, Inc.") and replace 2024 with 2025 so the
header reads "SPDX-FileCopyrightText: 2025 LiveKit, Inc." while leaving the
SPDX-License-Identifier line unchanged.
🧹 Nitpick comments (6)
agents/src/voice/transcription/synchronizer.ts (2)

61-85: Potential negative dt in addByAnnotation when timestamps arrive out of order.

If startTime is less than pushedDuration (e.g., due to timing drift or reordered messages), dt becomes negative on line 68, resulting in a potentially negative or incorrect rate calculation on line 72. Consider guarding against this edge case.

🔧 Suggested defensive check
   addByAnnotation(text: string, startTime: number | undefined, endTime: number | undefined): void {
     if (startTime !== undefined) {
       // Calculate the integral of the speaking rate up to the start time
       const integral = this.speakIntegrals.length > 0 
         ? this.speakIntegrals[this.speakIntegrals.length - 1]! 
         : 0;

       const dt = startTime - this.pushedDuration;
+      // Guard against negative dt (out-of-order timestamps)
+      if (dt < 0) {
+        this.textBuffer.push(text);
+        if (endTime !== undefined) {
+          this.addByAnnotation('', endTime, undefined);
+        }
+        return;
+      }
       // Use the length of the text directly instead of hyphens
       const textLen = this.textBuffer.reduce((sum, t) => sum + t.length, 0);

90-124: Linear search mislabeled as binary search.

The comment mentions "Binary search" but the implementation is a linear scan (O(n)). For typical use cases with small arrays this is acceptable, but the comment is misleading.

📝 Fix comment or implement actual binary search
-    // Binary search for the right position (equivalent to np.searchsorted with side="right")
+    // Linear search for the right position (equivalent to np.searchsorted with side="right")
+    // Note: For small arrays this is efficient enough; consider binary search for large datasets
     let idx = 0;
     for (let i = 0; i < this.timestamps.length; i++) {
plugins/cartesia/src/types.ts (1)

83-89: Type naming may cause confusion - CartesiaServerMessage includes error messages.

CartesiaServerMessage is inferred from cartesiaMessageSchema (which is the union including error messages), not from cartesiaServerMessageSchema. This means the type includes CartesiaErrorMessage, which may be intentional but the naming suggests otherwise.

Consider either:

  1. Renaming to CartesiaMessage to reflect that it includes errors, or
  2. Creating a separate type for the full union
📝 Clarify type naming
-export type CartesiaServerMessage = z.infer<typeof cartesiaMessageSchema>;
+/** Union of all Cartesia messages including errors */
+export type CartesiaMessage = z.infer<typeof cartesiaMessageSchema>;
+/** Server messages excluding error messages */
+export type CartesiaServerMessage = z.infer<typeof cartesiaServerMessageSchema>;
plugins/cartesia/src/tts.ts (1)

372-386: Array bounds not validated before indexed access.

The word timestamps arrays (words, start, end) are assumed to have the same length, but this isn't validated. If arrays have mismatched lengths, undefined values could be accessed.

🔧 Add length validation
           if (this.#opts.wordTimestamps !== false && hasWordTimestamps(serverMsg)) {
             const wordTimestamps = serverMsg.word_timestamps;
+            const minLength = Math.min(
+              wordTimestamps.words.length,
+              wordTimestamps.start.length,
+              wordTimestamps.end.length,
+            );
-            for (let i = 0; i < wordTimestamps.words.length; i++) {
+            for (let i = 0; i < minLength; i++) {
               const word = wordTimestamps.words[i];
               const startTime = wordTimestamps.start[i];
               const endTime = wordTimestamps.end[i];
plugins/elevenlabs/src/tts.ts (1)

1037-1059: Consider event-based signaling instead of polling.

The 10ms polling loop works but consumes CPU cycles. Consider using a condition variable or Promise-based signaling when new data arrives. However, for the current use case, this is acceptable.

agents/src/voice/agent_activity.ts (1)

1219-1258: say() path doesn’t use aligned transcripts even when enabled.
transcriptionNode receives raw text only, so useTtsAlignedTranscript has no effect for AgentActivity.say(). Consider wiring ttsGenData.timedTextsFut into the transcription input (similar to the pipeline path) when the TTS supports aligned transcripts.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 25df43a and 540243e.

⛔ Files ignored due to path filters (1)
  • pnpm-lock.yaml is excluded by !**/pnpm-lock.yaml
📒 Files selected for processing (24)
  • agents/src/index.ts
  • agents/src/inference/stt.ts
  • agents/src/llm/realtime.ts
  • agents/src/tts/stream_adapter.ts
  • agents/src/tts/tts.ts
  • agents/src/types.ts
  • agents/src/voice/agent.ts
  • agents/src/voice/agent_activity.ts
  • agents/src/voice/agent_session.ts
  • agents/src/voice/generation.ts
  • agents/src/voice/index.ts
  • agents/src/voice/io.ts
  • agents/src/voice/transcription/synchronizer.ts
  • examples/src/basic_agent.ts
  • examples/src/timed_transcript_agent.ts
  • plugins/cartesia/package.json
  • plugins/cartesia/src/tts.ts
  • plugins/cartesia/src/types.ts
  • plugins/deepgram/src/stt.ts
  • plugins/deepgram/src/stt_v2.ts
  • plugins/elevenlabs/src/tts.ts
  • plugins/openai/src/realtime/api_proto.ts
  • plugins/openai/src/realtime/realtime_model.ts
  • plugins/openai/src/realtime/realtime_model_beta.ts
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{ts,tsx,js,jsx}

📄 CodeRabbit inference engine (.cursor/rules/agent-core.mdc)

Add SPDX-FileCopyrightText and SPDX-License-Identifier headers to all newly added files with '// SPDX-FileCopyrightText: 2025 LiveKit, Inc.' and '// SPDX-License-Identifier: Apache-2.0'

Files:

  • agents/src/tts/tts.ts
  • plugins/openai/src/realtime/api_proto.ts
  • agents/src/voice/agent_session.ts
  • agents/src/voice/io.ts
  • plugins/cartesia/src/types.ts
  • plugins/cartesia/src/tts.ts
  • agents/src/tts/stream_adapter.ts
  • plugins/deepgram/src/stt.ts
  • examples/src/timed_transcript_agent.ts
  • plugins/openai/src/realtime/realtime_model_beta.ts
  • agents/src/voice/generation.ts
  • agents/src/index.ts
  • agents/src/types.ts
  • agents/src/voice/agent.ts
  • agents/src/voice/agent_activity.ts
  • examples/src/basic_agent.ts
  • agents/src/inference/stt.ts
  • agents/src/voice/index.ts
  • plugins/deepgram/src/stt_v2.ts
  • agents/src/llm/realtime.ts
  • agents/src/voice/transcription/synchronizer.ts
  • plugins/elevenlabs/src/tts.ts
  • plugins/openai/src/realtime/realtime_model.ts
**/*.{ts,tsx}?(test|example|spec)

📄 CodeRabbit inference engine (.cursor/rules/agent-core.mdc)

When testing inference LLM, always use full model names from agents/src/inference/models.ts (e.g., 'openai/gpt-4o-mini' instead of 'gpt-4o-mini')

Files:

  • agents/src/tts/tts.ts
  • plugins/openai/src/realtime/api_proto.ts
  • agents/src/voice/agent_session.ts
  • agents/src/voice/io.ts
  • plugins/cartesia/src/types.ts
  • plugins/cartesia/src/tts.ts
  • agents/src/tts/stream_adapter.ts
  • plugins/deepgram/src/stt.ts
  • examples/src/timed_transcript_agent.ts
  • plugins/openai/src/realtime/realtime_model_beta.ts
  • agents/src/voice/generation.ts
  • agents/src/index.ts
  • agents/src/types.ts
  • agents/src/voice/agent.ts
  • agents/src/voice/agent_activity.ts
  • examples/src/basic_agent.ts
  • agents/src/inference/stt.ts
  • agents/src/voice/index.ts
  • plugins/deepgram/src/stt_v2.ts
  • agents/src/llm/realtime.ts
  • agents/src/voice/transcription/synchronizer.ts
  • plugins/elevenlabs/src/tts.ts
  • plugins/openai/src/realtime/realtime_model.ts
**/*.{ts,tsx}?(test|example)

📄 CodeRabbit inference engine (.cursor/rules/agent-core.mdc)

Initialize logger before using any LLM functionality with initializeLogger({ pretty: true }) from '@livekit/agents'

Files:

  • agents/src/tts/tts.ts
  • plugins/openai/src/realtime/api_proto.ts
  • agents/src/voice/agent_session.ts
  • agents/src/voice/io.ts
  • plugins/cartesia/src/types.ts
  • plugins/cartesia/src/tts.ts
  • agents/src/tts/stream_adapter.ts
  • plugins/deepgram/src/stt.ts
  • examples/src/timed_transcript_agent.ts
  • plugins/openai/src/realtime/realtime_model_beta.ts
  • agents/src/voice/generation.ts
  • agents/src/index.ts
  • agents/src/types.ts
  • agents/src/voice/agent.ts
  • agents/src/voice/agent_activity.ts
  • examples/src/basic_agent.ts
  • agents/src/inference/stt.ts
  • agents/src/voice/index.ts
  • plugins/deepgram/src/stt_v2.ts
  • agents/src/llm/realtime.ts
  • agents/src/voice/transcription/synchronizer.ts
  • plugins/elevenlabs/src/tts.ts
  • plugins/openai/src/realtime/realtime_model.ts
🧠 Learnings (4)
📚 Learning: 2026-01-16T14:33:39.551Z
Learnt from: CR
Repo: livekit/agents-js PR: 0
File: .cursor/rules/agent-core.mdc:0-0
Timestamp: 2026-01-16T14:33:39.551Z
Learning: Applies to examples/src/test_*.ts : For plugin component debugging (STT, TTS, LLM), create test example files prefixed with `test_` under the examples directory and run with `pnpm build && node ./examples/src/test_my_plugin.ts`

Applied to files:

  • plugins/deepgram/src/stt.ts
  • examples/src/timed_transcript_agent.ts
  • agents/src/voice/generation.ts
  • agents/src/index.ts
  • examples/src/basic_agent.ts
  • agents/src/voice/index.ts
📚 Learning: 2026-01-16T14:33:39.551Z
Learnt from: CR
Repo: livekit/agents-js PR: 0
File: .cursor/rules/agent-core.mdc:0-0
Timestamp: 2026-01-16T14:33:39.551Z
Learning: Use `pnpm build && pnpm dlx tsx ./examples/src/my_agent.ts dev|download-files --log-level=debug|info(default)` to run example agents from the examples directory

Applied to files:

  • examples/src/timed_transcript_agent.ts
📚 Learning: 2026-01-16T14:33:39.551Z
Learnt from: CR
Repo: livekit/agents-js PR: 0
File: .cursor/rules/agent-core.mdc:0-0
Timestamp: 2026-01-16T14:33:39.551Z
Learning: Applies to **/*.{ts,tsx}?(test|example) : Initialize logger before using any LLM functionality with `initializeLogger({ pretty: true })` from 'livekit/agents'

Applied to files:

  • agents/src/voice/generation.ts
  • examples/src/basic_agent.ts
📚 Learning: 2026-01-16T14:33:39.551Z
Learnt from: CR
Repo: livekit/agents-js PR: 0
File: .cursor/rules/agent-core.mdc:0-0
Timestamp: 2026-01-16T14:33:39.551Z
Learning: Applies to **/*.{ts,tsx}?(test|example|spec) : When testing inference LLM, always use full model names from `agents/src/inference/models.ts` (e.g., 'openai/gpt-4o-mini' instead of 'gpt-4o-mini')

Applied to files:

  • plugins/openai/src/realtime/realtime_model.ts
🧬 Code graph analysis (14)
agents/src/tts/tts.ts (1)
agents/src/voice/io.ts (1)
  • TimedString (48-55)
agents/src/voice/io.ts (3)
agents/src/voice/agent.ts (1)
  • ModelSettings (58-61)
agents/src/voice/index.ts (2)
  • ModelSettings (4-4)
  • TimedString (9-9)
agents/src/index.ts (3)
  • TimedString (37-37)
  • createTimedString (37-37)
  • isTimedString (37-37)
plugins/cartesia/src/tts.ts (2)
agents/src/voice/index.ts (1)
  • TimedString (9-9)
plugins/cartesia/src/types.ts (6)
  • CartesiaServerMessage (89-89)
  • cartesiaMessageSchema (74-77)
  • isErrorMessage (111-113)
  • hasWordTimestamps (115-119)
  • isChunkMessage (95-97)
  • isDoneMessage (103-105)
agents/src/tts/stream_adapter.ts (3)
agents/src/index.ts (1)
  • createTimedString (37-37)
agents/src/voice/io.ts (1)
  • createTimedString (60-75)
agents/src/types.ts (1)
  • USERDATA_TIMED_TRANSCRIPT (9-9)
plugins/deepgram/src/stt.ts (2)
agents/src/index.ts (1)
  • createTimedString (37-37)
agents/src/voice/io.ts (1)
  • createTimedString (60-75)
examples/src/timed_transcript_agent.ts (4)
agents/src/index.ts (6)
  • TimedString (37-37)
  • voice (40-40)
  • stream (40-40)
  • llm (40-40)
  • tts (40-40)
  • cli (40-40)
agents/src/voice/io.ts (2)
  • TimedString (48-55)
  • stream (98-100)
plugins/cartesia/src/tts.ts (1)
  • stream (130-132)
agents/src/voice/agent_activity.ts (2)
  • llm (337-339)
  • tts (341-343)
plugins/openai/src/realtime/realtime_model_beta.ts (1)
agents/src/voice/io.ts (2)
  • TimedString (48-55)
  • createTimedString (60-75)
agents/src/voice/generation.ts (3)
agents/src/utils.ts (4)
  • Future (123-160)
  • Task (420-532)
  • done (141-143)
  • done (525-527)
agents/src/voice/io.ts (3)
  • TimedString (48-55)
  • createTimedString (60-75)
  • isTimedString (80-87)
agents/src/types.ts (1)
  • USERDATA_TIMED_TRANSCRIPT (9-9)
agents/src/voice/agent.ts (4)
agents/src/voice/agent_activity.ts (1)
  • useTtsAlignedTranscript (367-370)
agents/src/voice/agent_session.ts (1)
  • useTtsAlignedTranscript (279-281)
agents/src/voice/io.ts (1)
  • TimedString (48-55)
agents/src/types.ts (1)
  • USERDATA_TIMED_TRANSCRIPT (9-9)
agents/src/voice/agent_activity.ts (3)
agents/src/voice/generation.ts (2)
  • performTTSInference (520-647)
  • _TTSGenerationData (54-63)
agents/src/voice/index.ts (1)
  • TimedString (9-9)
agents/src/voice/io.ts (1)
  • TimedString (48-55)
plugins/deepgram/src/stt_v2.ts (2)
agents/src/index.ts (1)
  • createTimedString (37-37)
agents/src/voice/io.ts (1)
  • createTimedString (60-75)
agents/src/llm/realtime.ts (2)
agents/src/index.ts (1)
  • TimedString (37-37)
agents/src/voice/io.ts (1)
  • TimedString (48-55)
plugins/elevenlabs/src/tts.ts (2)
agents/src/index.ts (2)
  • createTimedString (37-37)
  • TimedString (37-37)
agents/src/voice/index.ts (1)
  • TimedString (9-9)
plugins/openai/src/realtime/realtime_model.ts (6)
agents/src/tts/stream_adapter.ts (1)
  • stream (40-42)
agents/src/index.ts (3)
  • stream (40-40)
  • TimedString (37-37)
  • createTimedString (37-37)
agents/src/stt/stream_adapter.ts (1)
  • stream (36-38)
agents/src/stream/stream_channel.ts (1)
  • StreamChannel (7-12)
agents/src/stream/index.ts (1)
  • StreamChannel (7-7)
agents/src/voice/index.ts (1)
  • TimedString (9-9)
🔇 Additional comments (35)
plugins/cartesia/package.json (1)

52-52: The Zod version range is safe for this codebase. Both versions 3.25.76 and 4.1.8 exist on npm, and while Zod 4 has breaking changes (e.g., z.record() signature, error APIs, string format helpers), the schemas in types.ts only use compatible APIs (z.object(), z.array(), z.string(), z.number(), z.literal(), z.boolean(), z.union(), z.discriminatedUnion(), z.infer<>). None of these are affected by the breaking changes, so supporting both major versions is valid.

Likely an incorrect or invalid review comment.

plugins/deepgram/src/stt_v2.ts (1)

4-12: Consistent TimedString construction.
Aligns word entries with the TimedString factory for downstream alignment support.

Also applies to: 489-495

plugins/deepgram/src/stt.ts (1)

4-15: TimedString wrapping looks good.
This keeps word payloads consistent with the new timing-aware APIs.

Also applies to: 445-451

plugins/openai/src/realtime/api_proto.ts (1)

597-608: Optional start_time is a clean, compatible extension.
No issues with adding this field as optional.

agents/src/voice/io.ts (3)

27-35: Doc note on userdata support is helpful.
Clarifies future extraction path without affecting runtime behavior.


40-87: TimedString core utilities look solid.
The symbol marker + factory + type guard provide a clean, consistent API.


257-263: captureText widening is appropriate.
Allows timed segments to flow through the text pipeline as intended.

agents/src/index.ts (1)

37-37: Public re-export is appropriate.
Makes TimedString utilities discoverable from the package root.

agents/src/types.ts (1)

5-9: USERDATA_TIMED_TRANSCRIPT constant looks good.
Centralizing the key reduces drift across modules.

agents/src/voice/transcription/synchronizer.ts (1)

577-589: LGTM - Proper handling of timed transcripts without audio passthrough.

The flush logic correctly handles the case where timed transcripts are used: if text is pending but no audio was pushed to the synchronizer, it ends audio input to allow text processing rather than rotating the segment. This aligns with the PR objective of TTS-aligned transcripts where audio goes directly to the room.

plugins/cartesia/src/types.ts (1)

95-119: LGTM - Type guards provide clear message discrimination.

The type guards are well-implemented for runtime discrimination. The hasWordTimestamps helper provides good semantic clarity for the TTS alignment use case.

plugins/cartesia/src/tts.ts (2)

283-342: LGTM - Robust event channel pattern prevents message loss.

The refactored WebSocket handling with a buffered event channel and single listener registration is a solid improvement. This pattern correctly addresses the issue of message loss during listener re-registration that can occur with repeated once() calls.


412-432: LGTM - Proper coordination for stream termination.

The sentenceStreamClosed flag correctly coordinates the WebSocket lifecycle, ensuring the connection only closes when both: (1) Cartesia returns a done message, AND (2) all sentences have been sent. This prevents premature termination.

plugins/elevenlabs/src/tts.ts (2)

173-236: LGTM - Timestamp normalization correctly removes leading silence.

The toTimedWords function properly normalizes timestamps by subtracting firstWordOffsetMs, with Math.max(0, ...) guards preventing negative values. The documentation clearly explains why this is needed (ElevenLabs returns absolute timestamps that may include leading silence).


511-515: LGTM - First word offset capture logic is correct.

The offset is captured only once (when firstWordOffsetMs === null) and only from non-zero start times, correctly identifying the first actual word timing for normalization.

agents/src/voice/generation.ts (2)

536-555: LGTM - Clean text extraction stream implementation.

The IIFE pattern correctly transforms the mixed string | TimedString input stream into a text-only stream for the TTS node, with proper error handling and resource cleanup.


662-697: LGTM - Proper handling of TimedString in text forwarding.

The implementation correctly extracts the text for accumulation while passing the original TimedString (with timing metadata) to textOutput.captureText() for synchronization. This enables the synchronizer to use word-level timing information.

agents/src/inference/stt.ts (1)

491-499: LGTM - Correct use of createTimedString factory.

The change from object literals to using createTimedString ensures consistent TimedString objects with the proper TIMED_STRING_SYMBOL marker, aligning with the broader API surface updates across the codebase.

agents/src/llm/realtime.ts (1)

9-26: TimedString support in textStream looks good.

Clear type expansion and doc clarify downstream expectations.

examples/src/basic_agent.ts (2)

55-55: Switching to a concrete Cartesia TTS instance is solid.

Makes the example align with the new plugin usage.


62-66: Aligned transcript flag wiring looks good.

Explicitly enabling the option keeps the example clear.

agents/src/tts/tts.ts (2)

16-38: Timed transcripts on SynthesizedAudio are well integrated.

The optional field keeps compatibility while enabling timestamps.


42-55: Capability flag for aligned transcripts is clear and useful.

Nice, minimal API extension.

agents/src/voice/index.ts (1)

5-5: Exporting VoiceOptions is the right public surface update.

Keeps config types accessible to users.

agents/src/tts/stream_adapter.ts (1)

6-9: Imports for timed transcript support are appropriate.

No concerns here.

agents/src/voice/agent_session.ts (3)

76-82: VoiceOptions addition is well documented.

Matches the new aligned transcript flow.


95-95: Defaulting to aligned transcripts makes sense.

Please confirm providers that don’t support aligned transcripts still fall back cleanly to plain text.


275-281: Getter for useTtsAlignedTranscript is a clean addition.

Straightforward and consistent with other session getters.

plugins/openai/src/realtime/realtime_model_beta.ts (2)

58-61: TimedString propagation wiring looks good.
The widened channel typing and creation align with timed transcript streaming.

Also applies to: 1103-1107


1274-1286: Aligned transcript delta wrapping looks good.

plugins/openai/src/realtime/realtime_model.ts (2)

57-60: TimedString channel typing update looks consistent.

Also applies to: 1194-1199


1374-1385: No action required. The OpenAI Realtime API returns start_time in seconds, and TimedString.startTime is documented to accept seconds. The code correctly passes the value directly without conversion.

agents/src/voice/agent_activity.ts (1)

362-370: Agent-level override for aligned transcripts is clear.

agents/src/voice/agent.ts (2)

63-82: Agent-level useTtsAlignedTranscript option and transcription typing look good.

Also applies to: 165-192, 228-232


399-442: Timed transcript userdata attachment is solid.
This cleanly propagates aligned transcript data downstream.

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
plugins/elevenlabs/src/tts.ts (1)

510-519: Avoid shifting timestamps when the first spoken character starts at 0.

With the current “first non‑zero start” logic, a true 0ms start can get shifted by the next character’s start time. Consider keying the offset off the first non‑whitespace character instead.

🛠️ Suggested fix
-                if (ctx.firstWordOffsetMs === null && start > 0) {
-                  ctx.firstWordOffsetMs = start;
-                }
+                if (ctx.firstWordOffsetMs === null && char.trim().length > 0) {
+                  ctx.firstWordOffsetMs = start;
+                }
🤖 Fix all issues with AI agents
In `@agents/src/voice/transcription/synchronizer.ts`:
- Around line 199-202: The hasPendingText getter currently checks only whether
any text was ever pushed (this.textData.pushedText.length > 0); change it to
return true only when there is unforwarded text by comparing pushed vs forwarded
counts, i.e. return this.textData.pushedText.length >
this.textData.forwardedText.length; update the hasPendingText accessor in
synchronizer.ts to use this comparison so segment rotation logic (which relies
on hasPendingText) behaves correctly.
♻️ Duplicate comments (5)
agents/src/tts/stream_adapter.ts (1)

112-119: Guard audio.frame.userdata before assignment.

If AudioFrame.userdata can be undefined in @livekit/rtc-node, this write will throw and can also overwrite existing metadata. Please confirm initialization guarantees; otherwise initialize and append safely.

🛠️ Safer assignment pattern
-          audio.frame.userdata[USERDATA_TIMED_TRANSCRIPT] = [timedString];
+          const userdata = (audio.frame.userdata ??= {});
+          const existing = userdata[USERDATA_TIMED_TRANSCRIPT];
+          userdata[USERDATA_TIMED_TRANSCRIPT] = Array.isArray(existing)
+            ? [...existing, timedString]
+            : [timedString];
agents/src/voice/generation.ts (1)

586-624: The initialPushedDuration offset has no effect within a single inference.

The initialPushedDuration is always 0 because pushedDuration is initialized to 0 at line 567 and initialPushedDuration is captured immediately before the loop. The offset adjustment serves no functional purpose currently.

If this is scaffolding for future multi-inference duration accumulation, add a TODO comment explaining the intent. Otherwise, consider removing the unused offset logic.

♻️ Option 1: Add TODO explaining the scaffolding
       // pushed_duration stays CONSTANT within one inference. It represents
       // the cumulative duration from PREVIOUS TTS inferences. We capture it here before
       // the loop to match Python's behavior.
+      // TODO: Currently always 0 since pushedDuration is local. If multi-inference
+      // duration accumulation is needed, pass the initial offset from the caller.
       const initialPushedDuration = pushedDuration;
♻️ Option 2: Remove unused offset logic
-      // pushed_duration stays CONSTANT within one inference. It represents
-      // the cumulative duration from PREVIOUS TTS inferences. We capture it here before
-      // the loop to match Python's behavior.
-      const initialPushedDuration = pushedDuration;
-
       while (true) {
         // ...
             const adjustedTimedText = createTimedString({
               text: timedText.text,
-              startTime:
-                timedText.startTime !== undefined
-                  ? timedText.startTime + initialPushedDuration
-                  : undefined,
-              endTime:
-                timedText.endTime !== undefined
-                  ? timedText.endTime + initialPushedDuration
-                  : undefined,
+              startTime: timedText.startTime,
+              endTime: timedText.endTime,
               confidence: timedText.confidence,
               startTimeOffset: timedText.startTimeOffset,
             });
agents/src/voice/agent_activity.ts (1)

1417-1430: Guard against deadlock if TTS initialization fails before timed texts resolve.

ttsGenData.timedTextsFut.await can hang indefinitely if the TTS task throws before resolving the future. Consider racing it with the TTS task result or adding a timeout to avoid a stuck pipeline.

🛠️ Suggested fix
     // Check if we should use TTS aligned transcripts
     // Conditions: useTtsAlignedTranscript enabled, TTS has alignedTranscript capability, and we have ttsGenData
     if (this.useTtsAlignedTranscript && this.tts?.capabilities.alignedTranscript && ttsGenData) {
-      // Wait for the timed texts stream to be resolved
-      const timedTextsStream = await ttsGenData.timedTextsFut.await;
+      // Avoid hanging if TTS fails before timedTextsFut resolves
+      const timedTextsStream = await Promise.race([
+        ttsGenData.timedTextsFut.await,
+        ttsTask ? ttsTask.result.then(() => null).catch(() => null) : Promise.resolve(null),
+      ]);
       if (timedTextsStream) {
         this.logger.debug('Using TTS aligned transcripts for transcription node input');
         transcriptionInput = timedTextsStream;
       }
     }
plugins/cartesia/src/types.ts (1)

1-3: Copyright year should be 2025 per coding guidelines.

Based on coding guidelines, SPDX-FileCopyrightText should use 2025 for newly added files.

📝 Fix copyright year
-// SPDX-FileCopyrightText: 2024 LiveKit, Inc.
+// SPDX-FileCopyrightText: 2025 LiveKit, Inc.
 //
 // SPDX-License-Identifier: Apache-2.0
plugins/cartesia/src/tts.ts (1)

405-411: Typo in log message: "STT" should be "TTS".

The timeout debug log incorrectly refers to "STT chunk stream" but this is TTS code.

📝 Fix typo
             timeout = setTimeout(() => {
               // cartesia chunk timeout quite often, so we make it a debug log
               this.#logger.debug(
-                `Cartesia WebSocket STT chunk stream timeout after ${this.#opts.chunkTimeout}ms`,
+                `Cartesia WebSocket TTS chunk stream timeout after ${this.#opts.chunkTimeout}ms`,
               );
               ws.close();
             }, this.#opts.chunkTimeout);
🧹 Nitpick comments (2)
plugins/cartesia/src/types.ts (1)

64-76: Consider using .passthrough() for error schema to handle unknown fields.

The error message schema uses z.string() for the type field, which prevents it from being included in the discriminated union. However, if Cartesia adds new message types in the future, they would be parsed as errors. Consider whether this fallback behavior is intentional.

agents/src/voice/transcription/synchronizer.ts (1)

99-107: Linear scan instead of binary search.

The comment mentions "binary search" but the implementation is a linear scan. For small arrays this is fine, but consider using actual binary search for better performance with longer transcripts.

♻️ Binary search implementation
     // Binary search for the right position (equivalent to np.searchsorted with side="right")
-    let idx = 0;
-    for (let i = 0; i < this.timestamps.length; i++) {
-      if (this.timestamps[i]! <= timestamp) {
-        idx = i + 1;
-      } else {
-        break;
-      }
+    let lo = 0;
+    let hi = this.timestamps.length;
+    while (lo < hi) {
+      const mid = Math.floor((lo + hi) / 2);
+      if (this.timestamps[mid]! <= timestamp) {
+        lo = mid + 1;
+      } else {
+        hi = mid;
+      }
     }
+    const idx = lo;
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 540243e and 069d513.

📒 Files selected for processing (20)
  • agents/src/inference/stt.ts
  • agents/src/llm/realtime.ts
  • agents/src/tts/stream_adapter.ts
  • agents/src/tts/tts.ts
  • agents/src/types.ts
  • agents/src/voice/agent.ts
  • agents/src/voice/agent_activity.ts
  • agents/src/voice/agent_session.ts
  • agents/src/voice/generation.ts
  • agents/src/voice/index.ts
  • agents/src/voice/io.ts
  • agents/src/voice/transcription/synchronizer.ts
  • plugins/cartesia/src/tts.ts
  • plugins/cartesia/src/types.ts
  • plugins/deepgram/src/stt.ts
  • plugins/deepgram/src/stt_v2.ts
  • plugins/elevenlabs/src/tts.ts
  • plugins/openai/src/realtime/api_proto.ts
  • plugins/openai/src/realtime/realtime_model.ts
  • plugins/openai/src/realtime/realtime_model_beta.ts
🚧 Files skipped from review as they are similar to previous changes (8)
  • plugins/openai/src/realtime/api_proto.ts
  • agents/src/llm/realtime.ts
  • plugins/deepgram/src/stt.ts
  • agents/src/types.ts
  • agents/src/voice/index.ts
  • agents/src/tts/tts.ts
  • plugins/deepgram/src/stt_v2.ts
  • plugins/openai/src/realtime/realtime_model.ts
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{ts,tsx,js,jsx}

📄 CodeRabbit inference engine (.cursor/rules/agent-core.mdc)

Add SPDX-FileCopyrightText and SPDX-License-Identifier headers to all newly added files with '// SPDX-FileCopyrightText: 2025 LiveKit, Inc.' and '// SPDX-License-Identifier: Apache-2.0'

Files:

  • agents/src/voice/agent_session.ts
  • agents/src/inference/stt.ts
  • plugins/cartesia/src/tts.ts
  • agents/src/voice/transcription/synchronizer.ts
  • plugins/cartesia/src/types.ts
  • agents/src/voice/agent_activity.ts
  • agents/src/tts/stream_adapter.ts
  • plugins/elevenlabs/src/tts.ts
  • agents/src/voice/io.ts
  • agents/src/voice/agent.ts
  • plugins/openai/src/realtime/realtime_model_beta.ts
  • agents/src/voice/generation.ts
**/*.{ts,tsx}?(test|example|spec)

📄 CodeRabbit inference engine (.cursor/rules/agent-core.mdc)

When testing inference LLM, always use full model names from agents/src/inference/models.ts (e.g., 'openai/gpt-4o-mini' instead of 'gpt-4o-mini')

Files:

  • agents/src/voice/agent_session.ts
  • agents/src/inference/stt.ts
  • plugins/cartesia/src/tts.ts
  • agents/src/voice/transcription/synchronizer.ts
  • plugins/cartesia/src/types.ts
  • agents/src/voice/agent_activity.ts
  • agents/src/tts/stream_adapter.ts
  • plugins/elevenlabs/src/tts.ts
  • agents/src/voice/io.ts
  • agents/src/voice/agent.ts
  • plugins/openai/src/realtime/realtime_model_beta.ts
  • agents/src/voice/generation.ts
**/*.{ts,tsx}?(test|example)

📄 CodeRabbit inference engine (.cursor/rules/agent-core.mdc)

Initialize logger before using any LLM functionality with initializeLogger({ pretty: true }) from '@livekit/agents'

Files:

  • agents/src/voice/agent_session.ts
  • agents/src/inference/stt.ts
  • plugins/cartesia/src/tts.ts
  • agents/src/voice/transcription/synchronizer.ts
  • plugins/cartesia/src/types.ts
  • agents/src/voice/agent_activity.ts
  • agents/src/tts/stream_adapter.ts
  • plugins/elevenlabs/src/tts.ts
  • agents/src/voice/io.ts
  • agents/src/voice/agent.ts
  • plugins/openai/src/realtime/realtime_model_beta.ts
  • agents/src/voice/generation.ts
🧠 Learnings (3)
📚 Learning: 2026-01-16T14:33:39.551Z
Learnt from: CR
Repo: livekit/agents-js PR: 0
File: .cursor/rules/agent-core.mdc:0-0
Timestamp: 2026-01-16T14:33:39.551Z
Learning: Applies to examples/src/test_*.ts : For plugin component debugging (STT, TTS, LLM), create test example files prefixed with `test_` under the examples directory and run with `pnpm build && node ./examples/src/test_my_plugin.ts`

Applied to files:

  • plugins/cartesia/src/tts.ts
  • agents/src/voice/generation.ts
📚 Learning: 2026-01-16T14:33:39.551Z
Learnt from: CR
Repo: livekit/agents-js PR: 0
File: .cursor/rules/agent-core.mdc:0-0
Timestamp: 2026-01-16T14:33:39.551Z
Learning: Applies to **/*.{ts,tsx,js,jsx} : Add SPDX-FileCopyrightText and SPDX-License-Identifier headers to all newly added files with '// SPDX-FileCopyrightText: 2025 LiveKit, Inc.' and '// SPDX-License-Identifier: Apache-2.0'

Applied to files:

  • plugins/cartesia/src/types.ts
📚 Learning: 2026-01-16T14:33:39.551Z
Learnt from: CR
Repo: livekit/agents-js PR: 0
File: .cursor/rules/agent-core.mdc:0-0
Timestamp: 2026-01-16T14:33:39.551Z
Learning: Applies to **/*.{ts,tsx}?(test|example) : Initialize logger before using any LLM functionality with `initializeLogger({ pretty: true })` from 'livekit/agents'

Applied to files:

  • plugins/openai/src/realtime/realtime_model_beta.ts
  • agents/src/voice/generation.ts
🧬 Code graph analysis (8)
agents/src/inference/stt.ts (1)
agents/src/voice/io.ts (2)
  • TimedString (40-47)
  • createTimedString (52-67)
plugins/cartesia/src/tts.ts (2)
agents/src/voice/index.ts (1)
  • TimedString (9-9)
plugins/cartesia/src/types.ts (6)
  • CartesiaServerMessage (88-88)
  • cartesiaMessageSchema (73-76)
  • isErrorMessage (110-112)
  • hasWordTimestamps (114-116)
  • isChunkMessage (94-96)
  • isDoneMessage (102-104)
agents/src/voice/transcription/synchronizer.ts (1)
agents/src/voice/io.ts (2)
  • TimedString (40-47)
  • isTimedString (72-79)
agents/src/voice/agent_activity.ts (4)
agents/src/voice/generation.ts (3)
  • performTTSInference (528-661)
  • _TTSGenerationData (62-71)
  • performAudioForwarding (810-828)
agents/src/voice/index.ts (1)
  • TimedString (9-9)
agents/src/voice/io.ts (1)
  • TimedString (40-47)
agents/src/index.ts (1)
  • TimedString (37-37)
agents/src/tts/stream_adapter.ts (2)
agents/src/voice/io.ts (1)
  • createTimedString (52-67)
agents/src/types.ts (1)
  • USERDATA_TIMED_TRANSCRIPT (8-8)
plugins/elevenlabs/src/tts.ts (1)
agents/src/voice/index.ts (1)
  • TimedString (9-9)
plugins/openai/src/realtime/realtime_model_beta.ts (1)
agents/src/voice/io.ts (2)
  • TimedString (40-47)
  • createTimedString (52-67)
agents/src/voice/generation.ts (2)
agents/src/voice/io.ts (4)
  • TimedString (40-47)
  • TTSNode (27-30)
  • createTimedString (52-67)
  • isTimedString (72-79)
agents/src/types.ts (1)
  • USERDATA_TIMED_TRANSCRIPT (8-8)
🔇 Additional comments (30)
plugins/elevenlabs/src/tts.ts (3)

11-12: Timed transcript normalization and TimedString construction look solid.

Using createTimedString and subtracting the first-word offset keeps timestamps in seconds and consistent with the new TimedString contract.

Also applies to: 120-121, 173-239, 299-311, 531-537, 557-563


1021-1077: Queued timed transcripts are properly attached to frames.

Draining the timed transcript queue and attaching to the next emitted frame (including final flush) should prevent alignment drops.


643-651: syncAlignment properly gates alignedTranscript capability.

Mapping the option into the base TTS capability makes feature gating explicit and consistent.

agents/src/voice/agent_session.ts (1)

76-81: Aligned transcript option is well surfaced at the session level.

Defaulting to true and exposing a getter keeps behavior centralized and discoverable.

Also applies to: 94-95, 274-279

agents/src/tts/stream_adapter.ts (1)

6-9: StreamAdapter timing accumulation is clear.

cumulativeDuration plus createTimedString gives deterministic token start times while keeping the adapter aligned‑transcript aware.

Also applies to: 18-18, 57-58, 106-110

agents/src/inference/stt.ts (1)

19-19: TimedString factory use keeps markers consistent.

Centralizing construction through createTimedString ensures the symbol marker is always set.

Also applies to: 492-499

plugins/openai/src/realtime/realtime_model_beta.ts (2)

13-20: TimedString-capable text channels are wired correctly.

Typing the text channel as string | TimedString while keeping audioTranscript concatenation on raw deltas preserves downstream expectations.

Also applies to: 66-66, 1112-1112, 1291-1291


1280-1284: No action needed. The OpenAI Realtime API returns start_time in seconds, not milliseconds, so the current code is correct. The event.start_time value can be passed directly to createTimedString without conversion.

Likely an incorrect or invalid review comment.

agents/src/voice/io.ts (1)

32-79: TimedString utilities and TextOutput signature update look good.

Symbol marker + factory + guard centralize aligned transcript handling, and captureText now supports timed entries.

Also applies to: 249-249

agents/src/voice/agent_activity.ts (4)

362-369: LGTM! Clean precedence logic for the new setting.

The getter correctly implements the agent-level override pattern, allowing per-agent configuration to take precedence over session-level defaults.


1247-1261: LGTM! Proper integration of TTS generation data.

The refactoring correctly uses ttsGenData.audioStream for audio forwarding while enabling the timed transcripts pipeline.


1453-1465: LGTM! Consistent use of ttsGenData for audio forwarding.

The null check with a clear error message ensures developers understand when the invariant is violated.


1855-1856: Type signature correctly updated to support TimedString.

The realtime generation task properly handles the mixed string | TimedString stream type for transcription input.

plugins/cartesia/src/types.ts (1)

94-115: Type guards are redundant after Zod parsing but useful for type narrowing.

Since cartesiaMessageSchema.parse() already validates the message structure, these guards primarily serve TypeScript type narrowing. The implementation is correct.

agents/src/voice/transcription/synchronizer.ts (3)

36-133: LGTM! SpeakingRateData provides timing interpolation for TTS alignment.

The class correctly tracks word timing annotations and computes accumulated speaking units for synchronization. The approach of storing timestamps, rates, and integrals allows for efficient interpolation.


579-591: LGTM! Timed texts bypass audio synchronizer correctly.

When using TTS-aligned transcripts, audio goes directly to the room while text timing comes from TTS annotations. The logic correctly handles the case where text is pending but no audio was pushed through the synchronizer.


653-660: Good addition of barrier await before accessing _impl.

This prevents race conditions where flush could access an outdated _impl during segment rotation.

agents/src/voice/agent.ts (3)

75-82: LGTM! Clean implementation of the useTtsAlignedTranscript option.

The option is well-documented, properly stored, and exposed via a getter. The comment referencing the Python implementation (agent.py line 50, 80) is helpful for cross-language parity.

Also applies to: 91-94, 118-119, 164-165, 185-190


430-433: LGTM! Timed transcripts correctly attached to audio frames.

The USERDATA_TIMED_TRANSCRIPT key is used consistently to propagate word-level timing from TTS through the audio pipeline.


447-452: LGTM! Default transcriptionNode passes through TimedString.

The passthrough behavior correctly preserves timing information when no custom transcription processing is needed.

plugins/cartesia/src/tts.ts (6)

56-74: LGTM! Well-documented wordTimestamps option with sensible default.

The option enables word-level timing data by default, aligning with the broader TTS-aligned transcript feature.


81-93: LGTM! Capabilities correctly reflect wordTimestamps setting.

The alignedTranscript capability is derived from the wordTimestamps option, ensuring downstream code can check TTS capabilities accurately.


280-340: Excellent refactoring to event channel pattern.

The switch from repeatedly attaching/detaching WebSocket listeners to a single event channel prevents message loss during processing. The pattern correctly buffers incoming messages.


370-386: LGTM! Word timestamps correctly converted to TimedString.

The code properly iterates through the word timestamps array and creates TimedString objects with timing data. Adding a space after each word ensures consistent tokenization downstream.


412-432: Good coordination between sentence stream and done messages.

The logic correctly waits for sentenceStreamClosed before processing the final done message, ensuring all sentences are sent before closing the WebSocket.


476-522: LGTM! toCartesiaOptions correctly handles streaming parameter.

The function conditionally adds add_timestamps: true only for streaming mode when word timestamps are enabled, matching Cartesia's WebSocket API requirements.

agents/src/voice/generation.ts (4)

58-71: LGTM! Clean interface for TTS generation data.

The _TTSGenerationData interface properly encapsulates the audio stream and timed texts future, enabling the agent activity to coordinate transcription with TTS output.


542-562: LGTM! Text extraction handles mixed string/TimedString input.

The async IIFE correctly extracts plain text for the TTS node while allowing the original TimedString objects to flow elsewhere if needed.


578-582: Critical: Future resolved before loop enables concurrent reading.

This design allows agent_activity to start consuming the timed texts stream while performTTSInference is still writing, enabling real-time transcript synchronization.


676-711: LGTM! forwardText correctly handles mixed string/TimedString input.

The function extracts text for accumulation while passing the original value (including TimedString with timing) to textOutput.captureText for synchronization.

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@agents/src/voice/transcription/synchronizer.test.ts`:
- Around line 1-3: Update the file header copyright year from 2024 to 2025 in
the SPDX comment block at the top of the file (the two-line block containing
"SPDX-FileCopyrightText" and "SPDX-License-Identifier"); replace "2024 LiveKit,
Inc." with "2025 LiveKit, Inc." so the header complies with the new file year
guideline.
♻️ Duplicate comments (3)
agents/src/voice/agent_activity.ts (1)

1414-1425: Guard against potential hang if TTS initialization fails before timedTextsFut resolves.

The await ttsGenData.timedTextsFut.await on line 1420 can hang indefinitely if TTS throws before resolving the future. Consider racing it with a failure signal from the TTS task.

🛠️ Suggested fix
     // Check if we should use TTS aligned transcripts
     if (this.useTtsAlignedTranscript && this.tts?.capabilities.alignedTranscript && ttsGenData) {
-      // Wait for the timed texts stream to be resolved
-      const timedTextsStream = await ttsGenData.timedTextsFut.await;
+      // Avoid hanging if TTS fails before timedTextsFut resolves
+      const timedTextsStream = await Promise.race([
+        ttsGenData.timedTextsFut.await,
+        ttsTask ? ttsTask.result.then(() => null).catch(() => null) : Promise.resolve(null),
+      ]);
       if (timedTextsStream) {
         this.logger.debug('Using TTS aligned transcripts for transcription node input');
         transcriptionInput = timedTextsStream;
       }
     }
agents/src/voice/transcription/synchronizer.ts (1)

199-201: Verify hasPendingText logic for both use cases.

The current implementation returns true if any text was ever pushed (pushedText.length > 0), rather than checking for unforwarded text (pushedText.length > forwardedText.length).

This may be intentional for the new TTS-aligned transcript flow where text is pushed without audio going through the synchronizer. However, for cases where text is being processed normally, this could return true even when all text has been forwarded.

Consider whether the logic should be:

return this.textData.pushedText.length > this.textData.forwardedText.length;
agents/src/voice/generation.ts (1)

569-619: LGTM with note about scaffolding.

The timed transcript extraction logic is correct:

  1. timedTextsFut is resolved before the loop (critical for streaming)
  2. Timed transcripts are extracted from frame.userdata and forwarded
  3. The initialPushedDuration offset is scaffolding for future multi-inference support (currently always 0)
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 069d513 and 6b9a5d9.

📒 Files selected for processing (9)
  • .changeset/odd-moose-check.md
  • agents/src/tts/tts.ts
  • agents/src/voice/agent.ts
  • agents/src/voice/agent_activity.ts
  • agents/src/voice/agent_session.ts
  • agents/src/voice/generation.ts
  • agents/src/voice/transcription/synchronizer.test.ts
  • agents/src/voice/transcription/synchronizer.ts
  • examples/src/basic_agent.ts
🚧 Files skipped from review as they are similar to previous changes (3)
  • examples/src/basic_agent.ts
  • agents/src/voice/agent_session.ts
  • agents/src/tts/tts.ts
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{ts,tsx,js,jsx}

📄 CodeRabbit inference engine (.cursor/rules/agent-core.mdc)

Add SPDX-FileCopyrightText and SPDX-License-Identifier headers to all newly added files with '// SPDX-FileCopyrightText: 2025 LiveKit, Inc.' and '// SPDX-License-Identifier: Apache-2.0'

Files:

  • agents/src/voice/transcription/synchronizer.test.ts
  • agents/src/voice/agent_activity.ts
  • agents/src/voice/transcription/synchronizer.ts
  • agents/src/voice/generation.ts
  • agents/src/voice/agent.ts
**/*.{ts,tsx}?(test|example|spec)

📄 CodeRabbit inference engine (.cursor/rules/agent-core.mdc)

When testing inference LLM, always use full model names from agents/src/inference/models.ts (e.g., 'openai/gpt-4o-mini' instead of 'gpt-4o-mini')

Files:

  • agents/src/voice/transcription/synchronizer.test.ts
  • agents/src/voice/agent_activity.ts
  • agents/src/voice/transcription/synchronizer.ts
  • agents/src/voice/generation.ts
  • agents/src/voice/agent.ts
**/*.{ts,tsx}?(test|example)

📄 CodeRabbit inference engine (.cursor/rules/agent-core.mdc)

Initialize logger before using any LLM functionality with initializeLogger({ pretty: true }) from '@livekit/agents'

Files:

  • agents/src/voice/transcription/synchronizer.test.ts
  • agents/src/voice/agent_activity.ts
  • agents/src/voice/transcription/synchronizer.ts
  • agents/src/voice/generation.ts
  • agents/src/voice/agent.ts
🧠 Learnings (3)
📚 Learning: 2026-01-16T14:33:39.551Z
Learnt from: CR
Repo: livekit/agents-js PR: 0
File: .cursor/rules/agent-core.mdc:0-0
Timestamp: 2026-01-16T14:33:39.551Z
Learning: Applies to **/{examples,test}/**/*.test.ts : Include both basic streaming and tool calling tests to verify full LLM functionality

Applied to files:

  • agents/src/voice/transcription/synchronizer.test.ts
📚 Learning: 2026-01-16T14:33:39.551Z
Learnt from: CR
Repo: livekit/agents-js PR: 0
File: .cursor/rules/agent-core.mdc:0-0
Timestamp: 2026-01-16T14:33:39.551Z
Learning: Applies to examples/src/test_*.ts : For plugin component debugging (STT, TTS, LLM), create test example files prefixed with `test_` under the examples directory and run with `pnpm build && node ./examples/src/test_my_plugin.ts`

Applied to files:

  • agents/src/voice/transcription/synchronizer.test.ts
  • agents/src/voice/generation.ts
📚 Learning: 2026-01-16T14:33:39.551Z
Learnt from: CR
Repo: livekit/agents-js PR: 0
File: .cursor/rules/agent-core.mdc:0-0
Timestamp: 2026-01-16T14:33:39.551Z
Learning: Applies to **/*.{ts,tsx}?(test|example) : Initialize logger before using any LLM functionality with `initializeLogger({ pretty: true })` from 'livekit/agents'

Applied to files:

  • agents/src/voice/generation.ts
🧬 Code graph analysis (2)
agents/src/voice/transcription/synchronizer.test.ts (1)
agents/src/voice/transcription/synchronizer.ts (1)
  • SpeakingRateData (40-134)
agents/src/voice/transcription/synchronizer.ts (1)
agents/src/voice/io.ts (2)
  • TimedString (40-47)
  • isTimedString (72-79)
🔇 Additional comments (23)
.changeset/odd-moose-check.md (1)

1-9: LGTM!

The changeset is properly formatted with appropriate patch version bumps for the TTS-aligned transcripts feature across the affected packages.

agents/src/voice/agent.ts (5)

29-33: LGTM!

The imports for USERDATA_TIMED_TRANSCRIPT and TimedString are correctly added to support the TTS-aligned transcripts feature.


85-86: LGTM!

The private field declaration follows the existing pattern for optional agent configuration.


176-179: LGTM!

The getter correctly returns boolean | undefined, allowing the agent-level setting to be undefined when not explicitly set, enabling proper precedence logic in AgentActivity.


409-413: LGTM!

The timed transcripts are correctly attached to frame.userdata using the USERDATA_TIMED_TRANSCRIPT constant, with proper null/empty checks before assignment.


204-208: LGTM!

The transcriptionNode signature is correctly updated to accept and return ReadableStream<string | TimedString>, enabling the flow of timing information through the transcription pipeline.

agents/src/voice/transcription/synchronizer.test.ts (1)

7-206: LGTM!

Comprehensive test coverage for SpeakingRateData including:

  • Constructor initialization
  • addByRate with single/multiple entries and zero rate
  • addByAnnotation with buffering, flushing, and recursive endTime handling
  • accumulateTo with interpolation, extrapolation, and capping logic
  • Integration scenarios for realistic TTS word-timing workflows

The mathematical assertions align correctly with the implementation.

agents/src/voice/agent_activity.ts (5)

63-75: LGTM!

The imports for _TTSGenerationData, ToolExecutionOutput, and TimedString are correctly added to support the TTS-aligned transcripts feature integration.


362-366: LGTM!

The getter correctly implements precedence logic where the agent-level setting takes priority over the session-level setting using nullish coalescing.


1243-1257: LGTM!

The code correctly uses the new _TTSGenerationData structure, extracting audioStream for audio forwarding.


1850-1864: LGTM!

The realtime generation path correctly types ttsTextInput and trTextInput as ReadableStream<string | TimedString>, maintaining consistency with the pipeline reply task.


1883-1891: LGTM!

The realtime path correctly uses the new _TTSGenerationData structure, extracting audioStream from ttsGenData.

agents/src/voice/transcription/synchronizer.ts (6)

11-17: LGTM!

The imports are correctly updated to include TimedString and isTimedString for handling timing information in the synchronizer.


36-134: LGTM!

The SpeakingRateData class is well-implemented with clear documentation. The timing annotation accumulation logic correctly handles:

  • Rate-based additions with integral calculation
  • Annotation-based additions with text buffering and flushing
  • Interpolation and extrapolation in accumulateTo

The class is appropriately exported for testing purposes.


232-262: LGTM!

The pushText method is correctly updated to handle both string and TimedString inputs. The lazy initialization of annotatedRate and proper extraction of timing information enables accurate synchronization with TTS timing annotations.


352-388: LGTM!

The timing synchronization logic correctly prioritizes TTS timing annotations when available (annotatedRate) and falls back to the estimated hyphen-per-second calculation otherwise. The comparison of target vs forwarded text lengths ensures accurate synchronization with actual TTS playback timing.


572-584: LGTM!

The updated flush logic correctly handles the TTS-aligned transcript flow where audio goes directly to the room while text still needs to be processed through the synchronizer. The distinction between "pending text" and "empty segment" ensures proper behavior for both flows.


623-661: LGTM!

The SyncedTextOutput changes correctly:

  1. Extract plain text from TimedString when the synchronizer is disabled (pass-through mode)
  2. Pass the full TimedString to pushText when enabled, preserving timing information
  3. Await the barrier in flush to ensure safe access to _impl after potential segment rotation
agents/src/voice/generation.ts (5)

27-39: LGTM!

The imports for USERDATA_TIMED_TRANSCRIPT and TimedString utilities are correctly added to support TTS-aligned transcript generation.


58-72: LGTM!

The _TTSGenerationData interface is well-designed, encapsulating the audio stream and timed transcripts future. Using a Future for timedTextsFut enables async resolution before the TTS loop completes.


519-553: LGTM!

The performTTSInference function correctly:

  1. Accepts ReadableStream<string | TimedString> for flexible input
  2. Creates a text-only stream for TTS consumption by extracting text from TimedString objects
  3. Handles both string and TimedString inputs uniformly

643-651: LGTM!

The return structure correctly provides the _TTSGenerationData object containing both the audio stream and timed texts future, enabling consumers to coordinate audio playback with synchronized transcription.


659-711: LGTM!

The text forwarding functions correctly handle ReadableStream<string | TimedString>:

  1. Extract text for accumulation in out.text
  2. Pass the original value (preserving TimedString) to textOutput.captureText for synchronized transcription
  3. Maintain consistent function signatures

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants