-
AIRIS Lab, KAIST
- Daejeon, South Korea
- https://www.kirak.kim
- @_kirak_kim
Highlights
- Pro
Stars
The repository provides code for running inference with the Meta Segment Anything Audio Model (SAM-Audio), links for downloading the trained model checkpoints, and example notebooks that show how t…
Amphion (/æmˈfaɪən/) is a toolkit for Audio, Music, and Speech Generation. Its purpose is to support reproducible research and help junior researchers and engineers get started in the field of audi…
Official repository for the 1st DAFx Parameter Estimation Challenge
Qwen3-omni is a natively end-to-end, omni-modal LLM developed by the Qwen team at Alibaba Cloud, capable of understanding text, audio, images, and video, as well as generating speech in real time.
Qwen3-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.
Dataset and evaluation code of ISDrama(ACM-MM 2025): Immersive Spatial Drama Generation through Multimodal Prompting
[Python3] Octave-Band and Fractional Octave-Band filter. For signal in time domain.
Implementation of the paper "Can Large Language Models Predict Audio Effects Parameters from Natural Language?"
A PyTorch implementation of the paper "ZigMa: A DiT-Style Mamba-based Diffusion Model" (ECCV 2024)
[NeurIPS 2025 Spotlight] Seeing Sound, Hearing Sight: Uncovering Modality Bias and Conflict of AI models in Sound Localization
A toolbox for skeleton-based action recognition.
[ICCV 2021] Image2Reverb: Cross-Modal Reverb Impulse Response Synthesis.
Code and datasets for 'Few-Shot Audio-Visual Learning of Environment Acoustics' (NeurIPS 2022)
[NeurIPS 2025] An official implementation of Flow-GRPO: Training Flow Matching Models via Online RL
The code for "Graph Diffusion Transformer for Multi-Conditional Molecular Generation"
a MUSHRA compliant web audio API based experiment software
Differentiable audio signal processors in PyTorch
An Easy-to-use, Scalable and High-performance RLHF Framework based on Ray (PPO & GRPO & REINFORCE++ & TIS & vLLM & Ray & Dynamic Sampling & Async Agentic RL)
[ISMIR 2025] A curated list of vision-to-music generation: methods, datasets, evaluation and challenges.
misuka-renderer / misuka
Forked from mitsuba-renderer/mitsuba3misuka: A differentiable room acoustic renderer
Implement a ChatGPT-like LLM in PyTorch from scratch, step by step
Code for Novel View Acoustic Synthesis paper
Reference implementation for DPO (Direct Preference Optimization)
a text-conditional diffusion probabilistic model capable of generating high fidelity audio.
PyTorch implementation of VALL-E(Zero-Shot Text-To-Speech), Reproduced Demo https://lifeiteng.github.io/valle/index.html
An implementation of Microsoft's "FastSpeech 2: Fast and High-Quality End-to-End Text to Speech"
The official repo for Both Ears Wide Open: Towards Language-Driven Spatial Audio Generation