About Me

Xize Cheng (成曦泽) is a fourth-year doctoral candidate at the School of Computer Science and Technology, Zhejiang University, expecting to graduate in June 2026. He is advised by Professor Zhou Zhao. I am actively looking for academic collaboration, feel free to drop me an email.

In 2024, I lead or participate in the following research topics:

Large Language Models(LLMs): Spoken Dialogue Systems / Audio Large Language Models
Audio Understanding: Sound Separation Model / Audio-Visual Speech

🔥 News

2025.01: 🎉🎉 4 papers (2 first author) are accepted by ICLR 2025!
2024.09: 🎉🎉 3 papers are accepted by NeurIPS 2024!
2024.09: 🎉🎉 1 papers (1 co-first author) are accepted by EMNLP 2024!
2024.07: 🎉🎉 2 papers (1 first author & 1 corresponding author) are accepted by ACL 2024!
2024.05: 🎉🎉 3 papers (2 co-first author & 1 corresponding author & 1 oral presentation) are accepted by ACMMM 2024!
2024.03: 🎉🎉 2 papers are accepted by ICML 2024!
2024.01: I start my internship at Alibaba, DAMO Academy, Tongyi Lab.
2023.10: 🎉🎉 I am awarded the National Scholarship (2023, Graduate student). Top 0.1% at Zhejiang University.
2023.09: 🎉🎉 1 paper is accepted by EMNLP 2023!
2023.09: 🎉🎉 1 paper is accepted by NeurIPS 2023!
2023.07: 🎉🎉 1 paper (1 co-first author) is accepted by ACMMM 2023!
2023.05: 🎉🎉 3 papers (1 first author) are accepted by ICCV 2023!
2023.06: AV-TranSpeech comes out! Media coverage: PaperWeekly and ByteDance.
2023.05: OpenSR will be presented in an oral presentation at ACL 2023!
2023.05: 🎉🎉 7 papers (1 first author & 2 co-first author, & 2 oral presentation)are accepted by ACL 2023!
2023.03: We created the first Audio-Visual Multi-lingual Speech Translation dataset AVMuST-TED!
2022.10: I was awarded the Outstanding Graduate Student and Triple Excellence Graduate Student of Zhejiang University!
2021.03: I started my internship at Taobao as an algorithm intern, conducting multi-modality research.

📝 Publications

Under Review

AVSET-10M: An Open Large-Scale Audio-Visual Dataset with High Correspondence Xize Cheng, Ziang Zhang, Zehan Wang, Minghui Fang, Rongjie Huang, Siqi Zheng, Ruofan Hu, Jionghao Bai, Tao Jin, Zhou Zhao Under Review

ICLR 2025

OmniSep: Unified Omni-Modality Sound Separation with Query-Mixup Xize Cheng, Siqi Zheng, Zehan Wang, Minghui Fang, Ziang Zhang, Rongjie Huang, Ziyang Ma, Shengpeng Ji, Jialong Zuo, Tao Jin, Zhou Zhao Under Review

ACL2023 Oral

OpenSR: Open-Modality Speech Recognition via Maintaining Multi-Modality Alignment Xize Cheng, Tao Jin, Linjun Li, Wang Lin, Xinyu Duan, Zhou Zhao ACL2023(Oral)

ICCV 2023

MixSpeech: Cross-Modality Self-Learning with Audio-Visual Stream Mixup for Visual Speech Translation and Recognition Xize Cheng, Tao Jin, Rongjie Huang, Linjun Li, Wang Lin, Zehan Wang, Huadai Liu, Ye Wang, Aoxiong Yin, Zhou Zhao ICCV2023

ACL 2023

AV-TranSpeech: Audio-Visual Robust Speech-to-Speech Translation Rongjie Haung*, Xize Cheng*, Huadai Liu*, Yi Ren, Linjun Li, Zhenhui Ye, Jinzheng He, Lichao Zhang, Jinglin Liu, Xiang Yin, Zhou Zhao ACL2023

Full Publication List

[*] denotes co-first authors, [#] denotes co-supervised, [✉] denotes corresponding author,

Spoken Dialogue System & Audio-Visual Speech Understanding

ICLR2025 VoxDialogue: Can Spoken Dialogue Systems Understand Information Beyond Words? Xize Cheng, Ruofan Hu, Xiaoda Yang, Jingyu Lu, Dongjie Fu, Zehan Wang, Shengpeng Ji, Rongjie Huang, Boyang Zhang, Tao Jin, Zhou Zhao.
ICLR2025 Wavtokenizer: an efficient acoustic discrete codec tokenizer for audio language modeling Shengpeng Ji, Ziyue Jiang, Xize Cheng, Rongjie Huang, Yidi Jiang, Qian Chen, Siqi Zheng, Zhou Zhao, et al.
Survey WavChat: A Survey of Spoken Dialogue Models Shengpeng Ji, Shujie Liu, Xize Cheng, Jian Li, Yidi Jiang, Jingzhen He, Yunfei Chu, Jin Xu, Zhou Zhao, et al.
EMNLP2024 AudioVSR: Enhancing Video Speech Recognition with Audio Data. Xiaoda Yang *#, Xize Cheng*, Jiaqi Duan, Hongshun Qiu, Minjie Hong, Minghui Fang, Shengpeng Ji, Jialong Zuo, Zhiqing Hong, Zhimeng Zhang, Tao Jin.
ACMMM2024 Synctalklip: Highly synchronized lip-readable speaker generation with multi-task learning Xiaoda Yang *#, Xize Cheng*, Dongjie Fu, Minghui Fang, Jialung Zuo, Shengpeng Ji, Zhou Zhao, Jin Tao
ACMMM2024 SegTalker: Segmentation-based Talking Face Generation with Mask-guided Local Editing Lingyu Xiong #, Xize Cheng ✉, Jintao Tan #, Xianjia Wu, Xiandong Li, Lei Zhu, Fei Ma, Minglei Li, Huang Xu, Zhihui Hu.
ACMMM2024 Oral Boosting Speech Recognition Robustness to Modality-Distortion with Contrast-Augmented Prompts Dongjie Fu#, Xize Cheng, Xiaoda Yang#, Wang Hanting, Zhou Zhao, Tao Jin.
ACL2024 TransFace: Unit-Based Audio-Visual Speech Synthesizer for Talking Head Translation. Xize Cheng, Rongjie Huang, Linjun Li, Tao Jin, Zehan Wang, Aoxiong Yin, Minglei Li, Xinyu Duan, Zhou Zhao.
ACL2024 Uni-Dubbing: Zero-Shot Speech Synthesis from Visual Articulation. Songju Lei #, Xize Cheng ✉, Mengjiao Lyu, Jianqiao Hu, Jintao Tan #, Runlin Liu, Lingyu Xiong #, Tao Jin, Xiandong Li, Zhou Zhao.
ICCV2023 MixSpeech: Cross-Modality Self-Learning with Audio-Visual Stream Mixup for Visual Speech Translation and Recognition. Xize Cheng, Tao Jin, Rongjie Huang, Linjun Li, Wang Lin, Zehan Wang, Huadai Liu, Ye Wang, Aoxiong Yin, Zhou Zhao.
ACL2023 AV-TranSpeech: Audio-Visual Robust Speech-to-Speech Translation. Rongjie Haung*, Xize Cheng*, Huadai Liu*, Yi Ren, Linjun Li, Zhenhui Ye, Jinzheng He, Lichao Zhang, Jinglin Liu, Xiang Yin, Zhou Zhao.
ACL2023 Contrastive Token-Wise Meta-Learning for Unseen Performer Visual Temporal-Aligned Translation. Linjun Li*, Tao Jin*, Xize Cheng*, Ye Wang, Wang Lin, Rongjie Huang and Zhou Zhao.

Under Review AVSET-10M: An Open Large-Scale Audio-Visual Dataset with High Correspondence Xize Cheng, Ziang Zhang, Zehan Wang, Minghui Fang, Rongjie Huang, Siqi Zheng, Ruofan Hu, Jionghao Bai, Tao Jin, Zhou Zhao.
ICLR2025 OmniSep: Unified Omni-Modality Sound Separation with Query-Mixup Xize Cheng, Siqi Zheng, Zehan Wang, Minghui Fang, Ziang Zhang, Rongjie Huang, Ziyang Ma, Shengpeng Ji, Jialong Zuo, Tao Jin, Zhou Zhao.
ICML2024 Omnibind: Large-scale omni multimodal representation via binding spaces Zehan Wang, Ziang Zhang, Hang Zhang, Luping Liu, Rongjie Huang, Xize Cheng, Hengshuang Zhao, Zhou Zhao
NIPS2023 Connecting Multi-modal Contrastive Representations. Zehan Wang, Yang Zhao, Xize Cheng, Haifeng Huang, Jiageng Liu, Li Tang, Linjun Li, Yongqi Wang, Aoxiong Yin, Ziang Zhang, Zhou Zhao.
ACMMM2023 Rethinking Missing Modality Learning from a Decoding Perspective. Tao Jin *, Xize Cheng *, Linjun Li, Wang Lin, Ye Wang, Zhou Zhao.
ACL2023 Oral Weakly-Supervised Spoken Video Grounding via Semantic Interaction Learning. Ye Wang, Wang Lin, Shengyu Zhang, Tao Jin, Linjun Li, Xize Cheng and Zhou Zhao.
ACL2023 Oral OpenSR: Open-Modality Speech Recognition via Maintaining Multi-Modality Alignment. Xize Cheng, Tao Jin, Linjun Li, Wang Lin, Xinyu Duan, Zhou Zhao.

📖 Educations

2021.09 - 2026.06, Doctor, Zhejiang University, Hangzhou.
2017.09 - 2021.06, Undergraduate, Shandong Univeristy, Jinan.

🎖 Honors and Awards

National Scholarship (2023, Grauate student). Top 0.1% in Zhejiang University.
Excellent Graduate, Shandong Province (2021), Top 1%.
Outstanding Student Cadres (2017-2021 in Shandong University and 2021-2023 in Zhejiang University), Top 1%.
Academic Scholarship (2017-2021 in Shandong University and 2021-2023 in Zhejiang University), Top3%.
Outstanding Graduate Student & Triple Excellence Graduate Student(2022) in Zhejiang University.
First Prize (Meritorious Winner) in American Mathematical Modeling Competition (2019), Top 7% worldwide.
First Prize of National Mathematical Modeling Competition in Shandong Province (2018).

💬 Professional Services

Conference Reviewer: ARR 2023, ICCV 2023, ACL 2023
Assist to Review: KDD 2022, TNNLS 2022, TMM 2022, TMM 2023

💻 Internships & Projects

2024.01- 2024.09: Research Intern: Alibaba, DAMO Academy, Tongyi Lab at Hangzhou, China.
Research on Audio-Visual Sound Separation and Spoken Dialogue System.

2021.02 - 2021.08: Algorithm Engineer Intern: Taobao(China) Software
Research on Multi-modality Interaction.

Xize Cheng(成曦泽)