| Ego4d: Around the world in 3,000 hours of egocentric video K Grauman, A Westbury, E Byrne, Z Chavis, A Furnari, R Girdhar, ... Proceedings of the IEEE/CVF conference on computer vision and pattern …, 2022 | 2003 | 2022 |
| Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation JZ Wu, Y Ge, X Wang, SW Lei, Y Gu, Y Shi, W Hsu, Y Shan, X Qie, ... Proceedings of the IEEE/CVF international conference on computer vision …, 2023 | 1379 | 2023 |
| Temporal action localization in untrimmed videos via multi-stage cnns Z Shou, D Wang, SF Chang Proceedings of the IEEE Conference on Computer Vision and Pattern …, 2016 | 1259 | 2016 |
| Cdc: Convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos Z Shou, J Chan, A Zareian, K Miyazawa, SF Chang Proceedings of the IEEE conference on computer vision and pattern …, 2017 | 723 | 2017 |
| Show-o: One single transformer to unify multimodal understanding and generation J Xie, W Mao, Z Bai, DJ Zhang, W Wang, KQ Lin, Y Gu, Z Chen, Z Yang, ... International Conference on Learning Representations 2025, 28240-28264, 2025 | 675 | 2025 |
| Single shot temporal action detection T Lin, X Zhao, Z Shou Proceedings of the 25th ACM international conference on Multimedia, 988-996, 2017 | 577 | 2017 |
| Convnet architecture search for spatiotemporal feature learning D Tran, J Ray, Z Shou, SF Chang, M Paluri arXiv preprint arXiv:1708.05038, 2017 | 569 | 2017 |
| Hallucination of multimodal large language models: A survey Z Bai, P Wang, T Xiao, T He, Z Han, Z Zhang, MZ Shou arXiv preprint arXiv:2404.18930, 2024 | 554 | 2024 |
| Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives K Grauman, A Westbury, L Torresani, K Kitani, J Malik, T Afouras, ... Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern …, 2024 | 544 | 2024 |
| Channel augmented joint learning for visible-infrared recognition M Ye, W Ruan, B Du, MZ Shou Proceedings of the IEEE/CVF international conference on computer vision …, 2021 | 479 | 2021 |
| Magicanimate: Temporally consistent human image animation using diffusion model Z Xu, J Zhang, JH Liew, H Yan, JW Liu, C Zhang, J Feng, MZ Shou Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern …, 2024 | 455 | 2024 |
| Boxdiff: Text-to-image synthesis with training-free box-constrained diffusion J Xie, Y Li, Y Huang, H Liu, W Zhang, Y Zheng, MZ Shou Proceedings of the IEEE/CVF international conference on computer vision …, 2023 | 378 | 2023 |
| Show-1: Marrying pixel and latent diffusion models for text-to-video generation DJ Zhang, JZ Wu, JW Liu, R Zhao, L Ran, Y Gu, D Gao, MZ Shou International Journal of Computer Vision 133 (4), 1879-1893, 2025 | 363 | 2025 |
| Autoloc: Weakly-supervised temporal action localization in untrimmed videos Z Shou, H Gao, L Zhang, K Miyazawa, SF Chang Proceedings of the european conference on computer vision (ECCV), 154-171, 2018 | 362 | 2018 |
| Egocentric video-language pretraining KQ Lin, J Wang, M Soldan, M Wray, R Yan, EZ Xu, D Gao, RC Tu, W Zhao, ... Advances in Neural Information Processing Systems 35, 7575-7586, 2022 | 332 | 2022 |
| Diffumask: Synthesizing images with pixel-level annotations for semantic segmentation using diffusion models W Wu, Y Zhao, MZ Shou, H Zhou, C Shen Proceedings of the IEEE/CVF International Conference on Computer Vision …, 2023 | 329 | 2023 |
| Mix-of-show: Decentralized low-rank adaptation for multi-concept customization of diffusion models Y Gu, X Wang, JZ Wu, Y Shi, Y Chen, Z Fan, W Xiao, R Zhao, S Chang, ... Advances in Neural Information Processing Systems 36, 15890-15902, 2023 | 324 | 2023 |
| Univtg: Towards unified video-language temporal grounding KQ Lin, P Zhang, J Chen, S Pramanick, D Gao, AJ Wang, R Yan, MZ Shou Proceedings of the IEEE/CVF international conference on computer vision …, 2023 | 304 | 2023 |
| All in one: Exploring unified video-language pre-training J Wang, Y Ge, R Yan, Y Ge, KQ Lin, S Tsutsui, X Lin, G Cai, J Wu, Y Shan, ... Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern …, 2023 | 304 | 2023 |
| Is someone speaking? exploring long-term temporal features for audio-visual active speaker detection R Tao, Z Pan, RK Das, X Qian, MZ Shou, H Li Proceedings of the 29th ACM international conference on multimedia, 3927-3935, 2021 | 286 | 2021 |