Long time interval when synthesizing Chinese text-to-speech #930

zhanghx0905 · 2024-10-16T13:06:44Z

I have encountered an issue with the voice assistant when synthesizing Chinese text. The time interval between LLM and synthesized speech outputs is noticeably longer when the output is in Chinese compared to English. This issue does not occur when the output is in English, where the speech synthesis proceeds without any delay.

I noticed that the TTS speech synthesis almost always starts only after the LLM output has fully completed. I think there is something wrong with tokenizers.

2024-10-16 21:01:46,321 - DEBUG livekit.agents.pipeline - synthesizing agent reply {"speech_id": "ed7b151599e7", "elapsed": 1.509}
2024-10-16 21:01:47,311 - DEBUG livekit.agents.pipeline - received first LLM token {"speech_id": "ed7b151599e7", "elapsed": 0.988}
2024-10-16 21:02:03,955 - DEBUG livekit.agents.pipeline - received first TTS frame {"speech_id": "ed7b151599e7", "elapsed": 16.644, "streamed": true}

That's my code,

def prewarm(proc: JobProcess):
    proc.userdata["vad"] = silero.VAD.load()

async def entrypoint(ctx: JobContext):
    initial_ctx = llm.ChatContext().append(
        role="system",
        text=(
            """你是一个语音助手。你与用户的交互将通过语音进行。
你应该使用简短明了的回答，注意使用正确的标点符号断句。"""
        ),
    )

    logger.info(f"connecting to room {ctx.room.name}")
    await ctx.connect(auto_subscribe=AutoSubscribe.AUDIO_ONLY)

    # wait for the first participant to connect
    participant = await ctx.wait_for_participant()
    logger.info(f"starting voice assistant for participant {participant.identity}")

    agent = VoicePipelineAgent(
        vad=ctx.proc.userdata["vad"],
        stt=openai.STT(base_url=OPENAI_BASEURL, language="auto"),
        llm=openai.LLM(base_url=OPENAI_BASEURL, model=MODEL_NAME),
        tts=openai.TTS(base_url=OPENAI_BASEURL),
        transcription=AgentTranscriptionOptions(sentence_tokenizer=nltk.SentenceTokenizer(min_sentence_len=5)),
        chat_ctx=initial_ctx,
    )

    agent.start(ctx.room, participant)
    chat = rtc.ChatManager(ctx.room)

    async def answer_from_text(txt: str):
        chat_ctx = agent.chat_ctx.copy()
        chat_ctx.append(role="user", text=txt)
        stream = agent.llm.chat(chat_ctx=chat_ctx)
        await agent.say(stream)

    @chat.on("message_received")
    def on_chat_received(msg: rtc.ChatMessage):
        logger.info(msg)
        if msg.message:
            asyncio.create_task(answer_from_text(msg.message))

    await agent.say("你好，需要我的帮助吗？", allow_interruptions=True)


if __name__ == "__main__":
    cli.run_app(WorkerOptions(entrypoint_fnc=entrypoint, prewarm_fnc=prewarm))

The text was updated successfully, but these errors were encountered:

zhanghx0905 · 2024-10-16T14:06:04Z

I found https://github.com/livekit/agents/blob/main/livekit-agents/livekit/agents/tokenize/_basic_sent.py. Maybe I need a chinese version of it

theomonnom · 2024-10-16T18:14:43Z

Hey yes, _basic_sent would need to be edited. I think in the best world _basic_sent also work for Chinese.
Is ， the main character to look for when splitting sentences?

davidzhao · 2024-10-16T18:22:59Z

@zhanghx0905 are you interested in helping to make this better for Chinese? I think we'd need chinese period 。

zhanghx0905 · 2024-10-17T03:13:15Z

@zhanghx0905 are you interested in helping to make this better for Chinese? I think we'd need chinese period 。

I will make some attempts and see what I can do for this issue.

zhanghx0905 · 2024-10-26T14:05:09Z

TEN-Agent is founded by a Chinese team. I checked their implementation, and it looks not complicated, calling TTS when special symbol is matched.

    self.sentence_expr = re.compile(r".+?[,，.。!！?？:：]", re.DOTALL)

https://github.com/TEN-framework/TEN-Agent/blob/41d1a263f910916930b43cecb5278d26883c6a71/agents/ten_packages/extension/qwen_llm_python/qwen_llm_extension.py#L39C8-L39C71

I implemented similar logic for the livekit agent:

import functools
import re
from dataclasses import dataclass
from typing import List, Tuple

from livekit.agents.tokenize import token_stream, tokenizer

_sentence_pattern = re.compile(r".+?[,，.。!！?？:：]", re.DOTALL)


@dataclass
class _TokenizerOptions:
    language: str
    min_sentence_len: int
    stream_context_len: int


class ChineseSentenceTokenizer(tokenizer.SentenceTokenizer):
    def __init__(
        self,
        *,
        language: str = "chinese",
        min_sentence_len: int = 10,
        stream_context_len: int = 10,
    ) -> None:
        self._config = _TokenizerOptions(
            language=language,
            min_sentence_len=min_sentence_len,
            stream_context_len=stream_context_len,
        )

    def tokenize(self, text: str, *, language: str | None = None) -> List[str]:
        sentences = self.chinese_sentence_segmentation(text)
        return [sentence[0] for sentence in sentences]

    def stream(self, *, language: str | None = None) -> tokenizer.SentenceStream:
        return token_stream.BufferedSentenceStream(
            tokenizer=functools.partial(self.chinese_sentence_segmentation),
            min_token_len=self._config.min_sentence_len,
            min_ctx_len=self._config.stream_context_len,
        )

    def chinese_sentence_segmentation(self, text: str) -> List[Tuple[str, int, int]]:
        result = []
        start_pos = 0

        for match in _sentence_pattern.finditer(text):
            sentence = match.group(0)
            end_pos = match.end()
            sentence = sentence.strip()
            if sentence:
                result.append((sentence, start_pos, end_pos))
            start_pos = end_pos

        if start_pos < len(text):
            sentence = text[start_pos:].strip()
            if sentence:
                result.append((sentence, start_pos, len(text)))

        return result

You can use this class as following,

    agent = VoicePipelineAgent(
        # ...
        transcription=AgentTranscriptionOptions(
            sentence_tokenizer=ChineseSentenceTokenizer(),
        ),
    )

zhanghx0905 · 2024-10-26T16:10:41Z

I have more questions.
How does WordTokenizer work? Should I implement a Chinese version？
What does preemptive_synthesis=True mean? @davidzhao

davidzhao · 2024-10-28T01:46:00Z

WordTokenizer is used to sync realtime transcriptions (so we'd emit it word by word). For Chinese, it's really just splitting each unicode char by itself.

zhanghx0905 · 2024-10-28T13:47:50Z

I think I've finally figured out what most of the parameters really mean, and here's my final solution

from livekit.agents import tts as _tts

    tts = _tts.StreamAdapter(
        tts=openai.TTS(base_url=OPENAI_BASEURL),
        sentence_tokenizer=ChineseSentenceTokenizer(min_sentence_len=10),
    )
    agent = VoicePipelineAgent(
       #...
        tts=tts,
    )

zhanghx0905 closed this as completed Oct 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Long time interval when synthesizing Chinese text-to-speech #930

Long time interval when synthesizing Chinese text-to-speech #930

zhanghx0905 commented Oct 16, 2024 •

edited

Loading

zhanghx0905 commented Oct 16, 2024

theomonnom commented Oct 16, 2024 •

edited

Loading

davidzhao commented Oct 16, 2024 •

edited

Loading

zhanghx0905 commented Oct 17, 2024

zhanghx0905 commented Oct 26, 2024 •

edited

Loading

zhanghx0905 commented Oct 26, 2024 •

edited

Loading

davidzhao commented Oct 28, 2024

zhanghx0905 commented Oct 28, 2024 •

edited

Loading

Long time interval when synthesizing Chinese text-to-speech #930

Long time interval when synthesizing Chinese text-to-speech #930

Comments

zhanghx0905 commented Oct 16, 2024 • edited Loading

zhanghx0905 commented Oct 16, 2024

theomonnom commented Oct 16, 2024 • edited Loading

davidzhao commented Oct 16, 2024 • edited Loading

zhanghx0905 commented Oct 17, 2024

zhanghx0905 commented Oct 26, 2024 • edited Loading

zhanghx0905 commented Oct 26, 2024 • edited Loading

davidzhao commented Oct 28, 2024

zhanghx0905 commented Oct 28, 2024 • edited Loading

zhanghx0905 commented Oct 16, 2024 •

edited

Loading

theomonnom commented Oct 16, 2024 •

edited

Loading

davidzhao commented Oct 16, 2024 •

edited

Loading

zhanghx0905 commented Oct 26, 2024 •

edited

Loading

zhanghx0905 commented Oct 26, 2024 •

edited

Loading

zhanghx0905 commented Oct 28, 2024 •

edited

Loading