-
UESTC PhD, TJU Master's
Lists (6)
Sort Name ascending (A-Z)
Starred repositories
Kimi-Audio, an open-source audio foundation model excelling in audio understanding, generation, and conversation
LPIPS metric. pip install lpips
3D ResNets for Action Recognition (CVPR 2018)
State-of-the-art deep learning based audio codec supporting both mono 24 kHz audio and stereo 48 kHz audio.
🚀 Efficient implementations of state-of-the-art linear attention models
OpenMMLab Pre-training Toolbox and Benchmark
Scenic: A Jax Library for Computer Vision Research and Beyond
Vector (and Scalar) Quantization, in Pytorch
Whisper realtime streaming for long speech-to-text transcription and translation
Towhee is a framework that is dedicated to making neural data processing pipelines simple and fast.
open-source multimodal large language model that can hear, talk while thinking. Featuring real-time end-to-end speech input and streaming audio output conversational capabilities.
[EMNLP 2023 Demo] Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
Retinaface get 80.99% in widerface hard val using mobilenet0.25.
[ICLR 2023] Official implementation of the paper "DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection"
GeneFace: Generalized and High-Fidelity 3D Talking Face Synthesis; ICLR 2023; Official code
A Repository for Single- and Multi-modal Speaker Verification, Speaker Recognition and Speaker Diarization
✨✨[NeurIPS 2025] VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
This library provides common speech features for ASR including MFCCs and filterbank energies.
label-smooth, amsoftmax, partial-fc, focal-loss, triplet-loss, lovasz-softmax. Maybe useful
Official PyTorch code for "BAM: Bottleneck Attention Module (BMVC2018)" and "CBAM: Convolutional Block Attention Module (ECCV2018)"
PyTorch implementation of VALL-E(Zero-Shot Text-To-Speech), Reproduced Demo https://lifeiteng.github.io/valle/index.html
Speech Recognition using DeepSpeech2.
An implementation of Microsoft's "FastSpeech 2: Fast and High-Quality End-to-End Text to Speech"
本项目基于SadTalkers实现视频唇形合成的Wav2lip。通过以视频文件方式进行语音驱动生成唇形,设置面部区域可配置的增强方式进行合成唇形(人脸)区域画面增强,提高生成唇形的清晰度。使用DAIN 插帧的DL算法对生成视频进行补帧,补充帧间合成唇形的动作过渡,使合成的唇形更为流畅、真实以及自然。
🔓 Lip Reading - Cross Audio-Visual Recognition using 3D Architectures