open-source multimodal large language model that can hear, talk while thinking. Featuring real-time end-to-end speech input and streaming audio output conversational capabilities.

Python 3,426 294 Updated Nov 5, 2024

DAMO-NLP-SG / Video-LLaMA

[EMNLP 2023 Demo] Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

Python 3,085 282 Updated Jun 4, 2024

PyAV-Org / PyAV

Pythonic bindings for FFmpeg's libraries.

Python 3,018 411 Updated Oct 13, 2025

TylerYep / torchinfo

View model summaries in PyTorch!

Python 2,877 131 Updated Nov 3, 2025

biubug6 / Pytorch_Retinaface

Retinaface get 80.99% in widerface hard val using mobilenet0.25.

Python 2,876 797 Updated Jun 28, 2023

IDEA-Research / DINO

[ICLR 2023] Official implementation of the paper "DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection"

Python 2,666 295 Updated Jul 31, 2024

yerfor / GeneFace

GeneFace: Generalized and High-Fidelity 3D Talking Face Synthesis; ICLR 2023; Official code

Python 2,643 301 Updated Oct 18, 2024

modelscope / 3D-Speaker

A Repository for Single- and Multi-modal Speaker Verification, Speaker Recognition and Speaker Diarization

Python 2,542 228 Updated Oct 30, 2025

VITA-MLLM / VITA

✨✨[NeurIPS 2025] VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

Python 2,440 177 Updated Mar 28, 2025

jameslyons / python_speech_features

This library provides common speech features for ASR including MFCCs and filterbank energies.

Python 2,420 620 Updated Oct 20, 2021

CoinCheung / pytorch-loss

label-smooth, amsoftmax, partial-fc, focal-loss, triplet-loss, lovasz-softmax. Maybe useful

Python 2,253 373 Updated Oct 17, 2024

Jongchan / attention-module

Official PyTorch code for "BAM: Bottleneck Attention Module (BMVC2018)" and "CBAM: Convolutional Block Attention Module (ECCV2018)"

Python 2,194 409 Updated Mar 9, 2023

lifeiteng / vall-e

PyTorch implementation of VALL-E(Zero-Shot Text-To-Speech), Reproduced Demo https://lifeiteng.github.io/valle/index.html

Python 2,182 333 Updated Sep 10, 2025

microsoft / DeBERTa

The implementation of DeBERTa

Python 2,162 237 Updated Sep 29, 2023

SeanNaren / deepspeech.pytorch

Speech Recognition using DeepSpeech2.

Python 2,136 629 Updated Dec 13, 2022

ming024 / FastSpeech2

An implementation of Microsoft's "FastSpeech 2: Fast and High-Quality End-to-End Text to Speech"

Python 2,103 604 Updated Oct 27, 2023

Zz-ww / SadTalker-Video-Lip-Sync

本项目基于SadTalkers实现视频唇形合成的Wav2lip。通过以视频文件方式进行语音驱动生成唇形，设置面部区域可配置的增强方式进行合成唇形（人脸）区域画面增强，提高生成唇形的清晰度。使用DAIN 插帧的DL算法对生成视频进行补帧，补充帧间合成唇形的动作过渡，使合成的唇形更为流畅、真实以及自然。

Lam Chi LindgeW

Lists (6)

AVSE

AVSR

Lip2Speech/Speech2Lip

PaperReading

Super Star

VAE

Starred repositories

vector-quantization

speaker-embedding

language-modelling

beam-search

seq2seq

Machine learning

variational-inference

information-bottleneck

listen-attend-and-spell

chinese-speech-recognition