The repository provides code for running inference with the Meta Segment Anything Model 2 (SAM 2), links for downloading the trained model checkpoints, and example notebooks that show how to use th…

Jupyter Notebook 17,542 2,178 Updated Dec 25, 2024

facebookresearch / segment-anything

The repository provides code for running inference with the SegmentAnything Model (SAM), links for downloading the trained model checkpoints, and example notebooks that show how to use the model.

Jupyter Notebook 52,377 6,132 Updated Sep 18, 2024

taco-group / DecAlign

A novel cross-modal decoupling and alignment framework for multimodal representation learning.

JavaScript 36 1 Updated Mar 19, 2025

davidsandberg / facenet

Face recognition using Tensorflow

Python 14,228 4,808 Updated Jul 24, 2023

aizhiqi-work / MM-KWS

Code for the Interspeech 2024 paper "MM-KWS: Multi-modal Prompts for Multilingual User-defined Keyword Spotting"

Python 37 3 Updated May 10, 2025

EthanG97 / StimuVAR

The official implemention of StimuVAR: Spatiotemporal Stimuli-aware Video Affective Reasoning with Multimodal Large Language Models accepted by IJCV

Python 5 Updated Jul 10, 2025

mattiasxu / Video-VQVAE

VQVAE for video prediction

Python 29 7 Updated Apr 22, 2022

ZNLP / zero-shot-st

Forked from cwang621/zero-shot-st

Python 6 Updated Oct 18, 2022

mkotha / WaveRNN

A WaveRNN implementation

Python 201 48 Updated Oct 14, 2019

modelscope / FunCodec

FunCodec is a research-oriented toolkit for audio quantization and downstream applications, such as text-to-speech synthesis, music generation et.al.

Python 431 34 Updated Jan 25, 2024

jingxuebin20 / AVF

audio-visual,embedding fusion,inter-attention

Shell 3 Updated Oct 12, 2023

VIPL-Audio-Visual-Speech-Understanding / CAS-VSR-MOV20

CAS-VSR-MOV20: A challenging dataset for Chinese visual speech recognition, consisting of video clips from 20 movies.

3 Updated Jun 5, 2025

rese1f / Awesome-VQVAE

A collection of resources and papers on Vector Quantized Variational Autoencoder (VQ-VAE) and its application

320 10 Updated Jan 31, 2025

wusize / Harmon

[ICCV2025]Code Release of Harmonizing Visual Representations for Unified Multimodal Understanding and Generation

Python 177 5 Updated May 21, 2025

AIDC-AI / Awesome-Unified-Multimodal-Models

Awesome Unified Multimodal Models

850 25 Updated Aug 17, 2025

Shubhamai / pytorch-vqgan

This repo contains the implementation of VQGAN, Taming Transformers for High-Resolution Image Synthesis in PyTorch from scratch. I have added support for custom datasets, testings, experiment track…

Python 37 4 Updated Aug 20, 2024

archiki / Robust-E2E-ASR

This repository contains the code for our upcoming paper An Investigation of End-to-End Models for Robust Speech Recognition at ICASSP 2021.

Lam Chi LindgeW

Lists (6)

AVSE

AVSR

Lip2Speech/Speech2Lip

PaperReading

Super Star

VAE

Starred repositories

vector-quantization

speaker-embedding

language-modelling

beam-search

seq2seq

Machine learning

variational-inference

information-bottleneck

listen-attend-and-spell

chinese-speech-recognition