Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long time interval when synthesizing Chinese text-to-speech #930

Closed
zhanghx0905 opened this issue Oct 16, 2024 · 8 comments
Closed

Long time interval when synthesizing Chinese text-to-speech #930

zhanghx0905 opened this issue Oct 16, 2024 · 8 comments

Comments

@zhanghx0905
Copy link

zhanghx0905 commented Oct 16, 2024

I have encountered an issue with the voice assistant when synthesizing Chinese text. The time interval between LLM and synthesized speech outputs is noticeably longer when the output is in Chinese compared to English. This issue does not occur when the output is in English, where the speech synthesis proceeds without any delay.

I noticed that the TTS speech synthesis almost always starts only after the LLM output has fully completed. I think there is something wrong with tokenizers.

2024-10-16 21:01:46,321 - DEBUG livekit.agents.pipeline - synthesizing agent reply {"speech_id": "ed7b151599e7", "elapsed": 1.509}
2024-10-16 21:01:47,311 - DEBUG livekit.agents.pipeline - received first LLM token {"speech_id": "ed7b151599e7", "elapsed": 0.988}
2024-10-16 21:02:03,955 - DEBUG livekit.agents.pipeline - received first TTS frame {"speech_id": "ed7b151599e7", "elapsed": 16.644, "streamed": true}

That's my code,

def prewarm(proc: JobProcess):
    proc.userdata["vad"] = silero.VAD.load()

async def entrypoint(ctx: JobContext):
    initial_ctx = llm.ChatContext().append(
        role="system",
        text=(
            """你是一个语音助手。你与用户的交互将通过语音进行。
你应该使用简短明了的回答,注意使用正确的标点符号断句。"""
        ),
    )

    logger.info(f"connecting to room {ctx.room.name}")
    await ctx.connect(auto_subscribe=AutoSubscribe.AUDIO_ONLY)

    # wait for the first participant to connect
    participant = await ctx.wait_for_participant()
    logger.info(f"starting voice assistant for participant {participant.identity}")

    agent = VoicePipelineAgent(
        vad=ctx.proc.userdata["vad"],
        stt=openai.STT(base_url=OPENAI_BASEURL, language="auto"),
        llm=openai.LLM(base_url=OPENAI_BASEURL, model=MODEL_NAME),
        tts=openai.TTS(base_url=OPENAI_BASEURL),
        transcription=AgentTranscriptionOptions(sentence_tokenizer=nltk.SentenceTokenizer(min_sentence_len=5)),
        chat_ctx=initial_ctx,
    )

    agent.start(ctx.room, participant)
    chat = rtc.ChatManager(ctx.room)

    async def answer_from_text(txt: str):
        chat_ctx = agent.chat_ctx.copy()
        chat_ctx.append(role="user", text=txt)
        stream = agent.llm.chat(chat_ctx=chat_ctx)
        await agent.say(stream)

    @chat.on("message_received")
    def on_chat_received(msg: rtc.ChatMessage):
        logger.info(msg)
        if msg.message:
            asyncio.create_task(answer_from_text(msg.message))

    await agent.say("你好,需要我的帮助吗?", allow_interruptions=True)


if __name__ == "__main__":
    cli.run_app(WorkerOptions(entrypoint_fnc=entrypoint, prewarm_fnc=prewarm))

@zhanghx0905
Copy link
Author

@theomonnom
Copy link
Member

theomonnom commented Oct 16, 2024

Hey yes, _basic_sent would need to be edited. I think in the best world _basic_sent also work for Chinese.
Is the main character to look for when splitting sentences?

@davidzhao
Copy link
Member

davidzhao commented Oct 16, 2024

@zhanghx0905 are you interested in helping to make this better for Chinese? I think we'd need chinese period

@zhanghx0905
Copy link
Author

@zhanghx0905 are you interested in helping to make this better for Chinese? I think we'd need chinese period

I will make some attempts and see what I can do for this issue.

@zhanghx0905
Copy link
Author

zhanghx0905 commented Oct 26, 2024

TEN-Agent is founded by a Chinese team. I checked their implementation, and it looks not complicated, calling TTS when special symbol is matched.

    self.sentence_expr = re.compile(r".+?[,,.。!!??::]", re.DOTALL)

https://github.com/TEN-framework/TEN-Agent/blob/41d1a263f910916930b43cecb5278d26883c6a71/agents/ten_packages/extension/qwen_llm_python/qwen_llm_extension.py#L39C8-L39C71

I implemented similar logic for the livekit agent:

import functools
import re
from dataclasses import dataclass
from typing import List, Tuple

from livekit.agents.tokenize import token_stream, tokenizer

_sentence_pattern = re.compile(r".+?[,,.。!!??::]", re.DOTALL)


@dataclass
class _TokenizerOptions:
    language: str
    min_sentence_len: int
    stream_context_len: int


class ChineseSentenceTokenizer(tokenizer.SentenceTokenizer):
    def __init__(
        self,
        *,
        language: str = "chinese",
        min_sentence_len: int = 10,
        stream_context_len: int = 10,
    ) -> None:
        self._config = _TokenizerOptions(
            language=language,
            min_sentence_len=min_sentence_len,
            stream_context_len=stream_context_len,
        )

    def tokenize(self, text: str, *, language: str | None = None) -> List[str]:
        sentences = self.chinese_sentence_segmentation(text)
        return [sentence[0] for sentence in sentences]

    def stream(self, *, language: str | None = None) -> tokenizer.SentenceStream:
        return token_stream.BufferedSentenceStream(
            tokenizer=functools.partial(self.chinese_sentence_segmentation),
            min_token_len=self._config.min_sentence_len,
            min_ctx_len=self._config.stream_context_len,
        )

    def chinese_sentence_segmentation(self, text: str) -> List[Tuple[str, int, int]]:
        result = []
        start_pos = 0

        for match in _sentence_pattern.finditer(text):
            sentence = match.group(0)
            end_pos = match.end()
            sentence = sentence.strip()
            if sentence:
                result.append((sentence, start_pos, end_pos))
            start_pos = end_pos

        if start_pos < len(text):
            sentence = text[start_pos:].strip()
            if sentence:
                result.append((sentence, start_pos, len(text)))

        return result

You can use this class as following,

    agent = VoicePipelineAgent(
        # ...
        transcription=AgentTranscriptionOptions(
            sentence_tokenizer=ChineseSentenceTokenizer(),
        ),
    )

@zhanghx0905
Copy link
Author

zhanghx0905 commented Oct 26, 2024

I have more questions.
How does WordTokenizer work? Should I implement a Chinese version?
What does preemptive_synthesis=True mean? @davidzhao

@davidzhao
Copy link
Member

WordTokenizer is used to sync realtime transcriptions (so we'd emit it word by word). For Chinese, it's really just splitting each unicode char by itself.

@zhanghx0905
Copy link
Author

zhanghx0905 commented Oct 28, 2024

I think I've finally figured out what most of the parameters really mean, and here's my final solution

from livekit.agents import tts as _tts

    tts = _tts.StreamAdapter(
        tts=openai.TTS(base_url=OPENAI_BASEURL),
        sentence_tokenizer=ChineseSentenceTokenizer(min_sentence_len=10),
    )
    agent = VoicePipelineAgent(
       #...
        tts=tts,
    )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants