research-article

Proposal Semantic Relationship Graph Network for Temporal Action Detection

Authors:

Minggang GanAuthors Info & Claims

ACM Transactions on Intelligent Systems and Technology, Volume 15, Issue 6

Article No.: 135, Pages 1 - 24

https://doi.org/10.1145/3702233

Published: 13 December 2024 Publication History

Abstract

Temporal action detection, a critical task in video activity understanding, is typically divided into two stages: proposal generation and classification. However, most existing methods overlook the importance of information transfer among proposals during classification, often treating each proposal in isolation, which hampers accurate label prediction. In this article, we propose a novel method for inferring semantic relationships both within and between action proposals, guiding the fusion of action proposal features accordingly. Building on this approach, we introduce the Proposal Semantic Relationship Graph Network (PSRGN), an end-to-end model that leverages intra-proposal semantic relationship graphs to extract cross-scale temporal context and an inter-proposal semantic relationship graph to incorporate complementary neighboring information, significantly improving proposal feature quality and overall detection performance. This is the first method to apply graph structure learning in temporal action detection, adaptively constructing the inter-proposal semantic graph. Extensive experiments on two datasets demonstrate the effectiveness of our approach, achieving state-of-the-art (SOTA). Code and results are available at http://github.com/Riiick2011/PSRGN.

References

[1]

Humam Alwassel, Silvio Giancola, and Bernard Ghanem. 2021. TSP: Temporally-sensitive pretraining of video encoders for localization tasks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 3173–3183.

[2]

Anurag Bagchi, Jazib Mahmood, Dolton Fernandes, and Ravi Kiran Sarvadevabhatla. 2021. Hear me out: Fusional approaches for audio augmented temporal action localization. arXiv:2106.14118. Retrieved from https://arxiv.org/abs/2106.14118

[3]

Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. 2015. ActivityNet: A large-scale video benchmark for human activity understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 961–970.

[4]

Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6299–6308.

[5]

Yaosen Chen, Bing Guo, Yan Shen, Wei Wang, Weichen Lu, and Xinhua Suo. 2021. Capsule boundary network with 3D convolutional dynamic routing for temporal action detection. IEEE Transactions on Circuits and Systems for Video Technology 32, 5 (2021), 2962–2975.

Digital Library

[6]

Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, et al. 2020. Rethinking attention with performers. arXiv:2009.14794. Retrieved from https://arxiv.org/abs/2009.14794

[7]

Zhen Cui, Chunyan Xu, Wenming Zheng, and Jian Yang. 2018. Context-dependent diffusion network for visual relationship detection. In Proceedings of the 26th ACM International Conference on Multimedia, 1475–1482.

Digital Library

[8]

Junyi Dong, Qingze Huo, and Silvia Ferrari. 2022. A holistic approach for role inference and action anticipation in human teams. ACM Transactions on Intelligent Systems and Technology 13, 6 (2022), 1–24.

Digital Library

[9]

Xinzhi Dong, Chengjiang Long, Wenju Xu, and Chunxia Xiao. 2021. Dual graph convolutional networks with transformer and curriculum learning for image captioning. In Proceedings of the 29th ACM International Conference on Multimedia, 2615–2624.

Digital Library

[10]

Luca Franceschi, Mathias Niepert, Massimiliano Pontil, and Xiao He. 2019. Learning discrete structures for graph neural networks. In Proceedings of the International Conference on Machine Learning, 1972–1982.

[11]

Ming-Gang Gan and Yan Zhang. 2022. Temporal attention-pyramid pooling for temporal action detection. IEEE Transactions on Multimedia (2022).

[12]

Ming-Gang Gan, Yan Zhang, and Shaowen Su. 2023. Temporal-visual proposal graph network for temporal action detection. Applied Intelligence 53, 21 (2023), 26008–26026.

Digital Library

[13]

Bo Jiang, Ziyan Zhang, Doudou Lin, Jin Tang, and Bin Luo. 2019. Semi-supervised learning with graph learning-convolutional networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11313–11320.

[14]

Yu-Gang Jiang, Jingen Liu, A. Roshan Zamir, George Toderici, Ivan Laptev, Mubarak Shah, and Rahul Sukthankar. 2014. THUMOS challenge: Action recognition with a large number of classes. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).

[15]

Thomas N. Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. arXiv:1609.02907. Retrieved from https://arxiv.org/abs/1609.02907

[16]

Yu Kong, Zhiqiang Tao, and Yun Fu. 2017. Deep sequential context networks for action prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1473–1481.

[17]

Jin Li, Xianglong Liu, Zhuofan Zong, Wanru Zhao, Mingyuan Zhang, and Jingkuan Song. 2020. Graph attention based proposal 3d convnets for action detection. In Proceedings of the AAAI Conference on Artificial Intelligence, 4626–4633.

[18]

Ruoyu Li, Sheng Wang, Feiyun Zhu, and Junzhou Huang. 2018. Adaptive graph convolutional neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32.

[19]

Xin Li, Tianwei Lin, Xiao Liu, Wangmeng Zuo, Chao Li, Xiang Long, Dongliang He, Fu Li, Shilei Wen, and Chuang Gan. 2020a. Deep concept-wise temporal convolutional networks for action localization. In Proceedings of the 28th ACM International Conference on Multimedia, 4004–4012.

Digital Library

[20]

Chuming Lin, Chengming Xu, Donghao Luo, Yabiao Wang, Ying Tai, Chengjie Wang, Jilin Li, Feiyue Huang, and Yanwei Fu. 2021. Learning salient boundary feature for anchor-free temporal action localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3320–3329.

[21]

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollar. 2017. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, 2980–2988.

[22]

Qinying Liu and Zilei Wang. 2020. Progressive boundary refinement network for temporal action detection. In Proceedings of the AAAI Conference on Artificial Intelligence, 11612–11619.

[23]

Xiaolong Liu, Song Bai, and Xiang Bai. 2022. An empirical study of end-to-end temporal action detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 20010–20019.

[24]

Xiaolong Liu, Yao Hu, Song Bai, Fei Ding, Xiang Bai, and Philip HS Torr. 2021. Multi-shot temporal event localization: A benchmark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 12596–12606.

[25]

Xiaolong Liu, Qimeng Wang, Yao Hu, Xu Tang, Shiwei Zhang, Song Bai, and Xiang Bai. 2022. End-to-end temporal action detection with transformer. IEEE Transactions on Image Processing 31 (2022), 5427–5441.

Digital Library

[26]

Fuchen Long, Ting Yao, Zhaofan Qiu, Xinmei Tian, Jiebo Luo, and Tao Mei. 2019. Gaussian temporal awareness networks for action localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 344–353.

[27]

Li Mi and Zhenzhong Chen. 2020. Hierarchical graph attention network for visual relationship detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 13886–13895.

[28]

Sauradip Nag, Xiatian Zhu, Jiankang Deng, Yi-Zhe Song, and Tao Xiang. 2023. DiffTAD: Temporal action detection with proposal denoising diffusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 10362–10374.

[29]

Sauradip Nag, Xiatian Zhu, Yi-Zhe Song, and Tao Xiang. 2022. Temporal action detection with global segmentation mask learning. In Proceedings of the European Conference on Computer Vision. Springer, 645–662.

[30]

Yifan Ren, Xing Xu, Fumin Shen, Zheng Wang, Yang Yang, and Heng Tao Shen. 2021. Multi-scale dynamic network for temporal action detection. In Proceedings of the International Conference on Multimedia Retrieval, 267–275.

Digital Library

[31]

Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese. 2019. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 658–666.

[32]

Lei Shi, Yifan Zhang, Jian Cheng, and Hanqing Lu. 2019. Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 12026–12035.

[33]

Leslie N. Smith. 2017. Cyclical learning rates for training neural networks. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision. IEEE, 464–472.

[34]

Rui Su, Dong Xu, Lu Sheng, and Wanli Ouyang. 2020. PCG-TAL: Progressive cross-granularity cooperation for temporal action localization. IEEE Transactions on Image Processing 30 (2020), 2103–2113.

Digital Library

[35]

Jing Tan, Jiaqi Tang, Limin Wang, and Gangshan Wu. 2021. Relaxed transformer decoders for direct action proposal generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 13526–13535.

[36]

Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. 2018. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6450–6459.

[37]

Petar Velivckovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. 2017. Graph attention networks. arXiv:1710.10903. Retrieved from https://arxiv.org/abs/1710.10903

[38]

Binglu Wang, Le Yang, and Yongqiang Zhao. 2021. POLO: Learning explicit cross-modality fusion for temporal action localization. IEEE Signal Processing Letters 28 (2021), 503–507.

[39]

Chenhao Wang, Hongxiang Cai, Yuxin Zou, and Yichao Xiong. 2021. RGB stream is enough for temporal action detection. arXiv:2107.04362. Retrieved from https://arxiv.org/abs/2107.04362

[40]

Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. 2016. Temporal segment networks: Towards good practices for deep action recognition. In Proceedings of the European Conference on Computer Vision. Springer, 20–36.

[41]

Qiang Wang, Yanhao Zhang, Yun Zheng, and Pan Pan. 2022. RCL: Recurrent continuous localization for temporal action detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 13566–13575.

[42]

Jiannan Wu, Peize Sun, Shoufa Chen, Jiewen Yang, Zihao Qi, Lan Ma, and Ping Luo. 2021. Towards high-quality temporal action detection with sparse proposals. arXiv:2109.08847. Retrieved from https://arxiv.org/abs/2109.08847

[43]

Kun Xia, Le Wang, Sanping Zhou, Nanning Zheng, and Wei Tang. 2022. Learning to refactor action and co-occurrence features for temporal action localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 13884–13893.

[44]

Mengmeng Xu, Chen Zhao, David S. Rojas, Ali Thabet, and Bernard Ghanem. 2020. G-TAD: Sub-graph localization for temporal action detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10156–10165.

[45]

Sijie Yan, Yuanjun Xiong, and Dahua Lin. 2018. Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence.

[46]

Min Yang, Guo Chen, Yin-Dong Zheng, Tong Lu, and Limin Wang. 2023. BasicTAD: An astounding rgb-only baseline for temporal action detection. Computer Vision and Image Understanding 232 (2023), 103692.

Digital Library

[47]

Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. 2018. Exploring visual relationship for image captioning. In Proceedings of the European Conference on Computer Vision, 684–699.

Digital Library

[48]

Runhao Zeng, Wenbing Huang, Mingkui Tan, Yu Rong, Peilin Zhao, Junzhou Huang, and Chuang Gan. 2019. Graph convolutional networks for temporal action localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 7094–7103.

[49]

Chen-Lin Zhang, Jianxin Wu, and Yin Li. 2022. ActionFormer: Localizing moments of actions with transformers. In Proceedings of the European Conference on Computer Vision. Springer, 492–510.

Digital Library

[50]

Wei Zhang, Binglu Wang, Songhui Ma, Yani Zhang, and Yongqiang Zhao. 2021. I2Net: Mining intra-video and inter-video attention for temporal action localization. Neurocomputing 444 (2021), 16–29.

[51]

Chen Zhao, Ali K. Thabet, and Bernard Ghanem. 2021. Video self-stitching graph network for temporal action localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 13658–13667.

[52]

Yue Zhao, Yuanjun Xiong, Limin Wang, Zhirong Wu, Xiaoou Tang, and Dahua Lin. 2017. Temporal action detection with structured segment networks. In Proceedings of the IEEE International Conference on Computer Vision, 2914–2923.

Index Terms

Proposal Semantic Relationship Graph Network for Temporal Action Detection
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Activity recognition and understanding

Recommendations

Temporal-visual proposal graph network for temporal action detection
Abstract
Temporal action detection is usually divided into two stages: temporal action proposal generation and proposal classification. Most methods consider the proposal classification stage as an action recognition task. However, compared with trimmed ...
Content Temporal Relation Network for temporal action proposal generation
Abstract
Temporal action proposal generation is an essential step for untrimmed video analysis and gains much attention from academia. However, most of the prior works predict the confidence score of each proposal separately and neglect the relations ...
Highlights
- Our method is the first framework to exploit the content and temporal semantic relations between proposals to generate temporal action proposals.
- We propose a novel adaptive-dilate Conv, whose dilate rate is adaptive to the spatial ...
Boundary Content Graph Neural Network for Temporal Action Proposal Generation
Computer Vision – ECCV 2020
Abstract
Temporal action proposal generation plays an important role in video action understanding, which requires localizing high-quality action content precisely. However, generating temporal proposals with both precise boundaries and high-quality action ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Intelligent Systems and Technology

ACM Transactions on Intelligent Systems and Technology Volume 15, Issue 6

December 2024

727 pages

EISSN:2157-6912

DOI:10.1145/3613712

Editor:
Huan Liu
Arizona State University, AZ

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 December 2024

Online AM: 28 October 2024

Accepted: 12 October 2024

Revised: 23 August 2024

Received: 28 November 2023

Published in TIST Volume 15, Issue 6

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Key Research and Development Program of China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
131
Total Downloads

Downloads (Last 12 months)131
Downloads (Last 6 weeks)12

Reflects downloads up to 12 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

Figures

Tables

Media

View full text|Download PDF

View Issue’s Table of Contents