Skip to content
View fengjiasun's full-sized avatar

Block or report fengjiasun

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse

Starred repositories

Showing results

We introduce temporal working memory (TWM), which aims to enhance the temporal modeling capabilities of Multimodal foundation models (MFMs). This plug-and-play module can be easily integrated into …

Python 312 30 Updated Nov 26, 2025

EVA Series: Visual Representation Fantasies from BAAI

Python 2,624 188 Updated Aug 1, 2024

A 5-way embedding model for text, audio, image, video, and 3D point clouds.

Python 8 3 Updated Nov 13, 2025

A dataset of 100M connections between 5 different modalities.

54 4 Updated Nov 14, 2025

Uses machine learning to denoise audio containing speech

Python 47 2 Updated Jun 22, 2024

Code implementation for the paper "Large-scale Pre-training for Grounded Video Caption Generation" (ICCV 2025)

Python 26 Updated Nov 9, 2025

AnyTalker: Scaling Multi-person Talking Video Generation with Interactivity Refinement

Python 231 37 Updated Dec 5, 2025

[ICCV 2025] Implementation for Describe Anything: Detailed Localized Image and Video Captioning

Python 1,435 85 Updated Jun 26, 2025

Video Grounding and Captioning

Python 332 73 Updated Oct 12, 2021

[ISMIR 2025] A curated list of vision-to-music generation: methods, datasets, evaluation and challenges.

113 3 Updated Aug 9, 2025

A curated list of Video to Audio Generation

88 3 Updated Nov 22, 2025

SoulX-Podcast is an inference codebase by the Soul AI team for generating high-fidelity podcasts from text.

Python 2,741 337 Updated Dec 11, 2025

Open-source industrial-grade ASR models supporting Mandarin, Chinese dialects and English, achieving a new SOTA on public Mandarin ASR benchmarks, while also offering outstanding singing lyrics rec…

Python 1,675 153 Updated Sep 22, 2025
Python 65 4 Updated Dec 5, 2025
Python 36 Updated Jul 4, 2024

Ming-UniAudio: Speech LLM for Joint Understanding, Generation and Editing with Unified Representation

Python 407 28 Updated Nov 27, 2025

Robust Speech Recognition via Large-Scale Weak Supervision

Python 92,169 11,547 Updated Dec 15, 2025
Python 1,453 152 Updated Nov 15, 2025

Qwen3-omni is a natively end-to-end, omni-modal LLM developed by the Qwen team at Alibaba Cloud, capable of understanding text, audio, images, and video, as well as generating speech in real time.

Jupyter Notebook 3,137 191 Updated Oct 9, 2025
Jupyter Notebook 46 Updated Apr 13, 2025

[CVPR 2023] Official implementation of the paper: Fine-grained Audible Video Description

Python 74 5 Updated Dec 4, 2023

Official PyTorch Implementation of "Scalable Diffusion Models with Transformers"

Python 8,178 736 Updated May 31, 2024

PAM is a no-reference audio quality metric for audio generation tasks

Python 76 6 Updated Jul 19, 2024

ACE-Step: A Step Towards Music Generation Foundation Model

Python 3,472 418 Updated Jun 27, 2025

Unified automatic quality assessment for speech, music, and sound.

Python 649 48 Updated Jun 5, 2025

🔥 1Panel provides an intuitive web interface and MCP Server to manage websites, files, containers, databases, and LLMs on a Linux server.

Go 32,532 2,885 Updated Dec 19, 2025

Kronos: A Foundation Model for the Language of Financial Markets

Python 9,569 2,051 Updated Nov 10, 2025

Lets make video diffusion practical!

Python 16,361 1,592 Updated Oct 16, 2025

Wan: Open and Advanced Large-Scale Video Generative Models

Python 12,909 1,503 Updated Dec 17, 2025

HunyuanVideo-Foley Demo Pages

TypeScript 4 Updated Aug 28, 2025
Next