Stars
12
stars
written in Python
Clear filter
Moshi is a speech-text foundation model and full-duplex spoken dialogue framework. It uses Mimi, a state-of-the-art streaming neural audio codec.
Kyutai's Speech-To-Text and Text-To-Speech models based on the Delayed Streams Modeling framework.
Convolutional neural network model for video classification trained on the Kinetics dataset.
Googles NotebookLM but local
Simple high-throughput inference library
Fine-tuning Moshi/J-Moshi on your own spoken dialogue data
A text embedding extension for the Polars Dataframe library.