skip to main content
research-article

Proposal Semantic Relationship Graph Network for Temporal Action Detection

Published: 13 December 2024 Publication History

Abstract

Temporal action detection, a critical task in video activity understanding, is typically divided into two stages: proposal generation and classification. However, most existing methods overlook the importance of information transfer among proposals during classification, often treating each proposal in isolation, which hampers accurate label prediction. In this article, we propose a novel method for inferring semantic relationships both within and between action proposals, guiding the fusion of action proposal features accordingly. Building on this approach, we introduce the Proposal Semantic Relationship Graph Network (PSRGN), an end-to-end model that leverages intra-proposal semantic relationship graphs to extract cross-scale temporal context and an inter-proposal semantic relationship graph to incorporate complementary neighboring information, significantly improving proposal feature quality and overall detection performance. This is the first method to apply graph structure learning in temporal action detection, adaptively constructing the inter-proposal semantic graph. Extensive experiments on two datasets demonstrate the effectiveness of our approach, achieving state-of-the-art (SOTA). Code and results are available at http://github.com/Riiick2011/PSRGN.

References

[1]
Humam Alwassel, Silvio Giancola, and Bernard Ghanem. 2021. TSP: Temporally-sensitive pretraining of video encoders for localization tasks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 3173–3183.
[2]
Anurag Bagchi, Jazib Mahmood, Dolton Fernandes, and Ravi Kiran Sarvadevabhatla. 2021. Hear me out: Fusional approaches for audio augmented temporal action localization. arXiv:2106.14118. Retrieved from https://arxiv.org/abs/2106.14118
[3]
Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. 2015. ActivityNet: A large-scale video benchmark for human activity understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 961–970.
[4]
Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6299–6308.
[5]
Yaosen Chen, Bing Guo, Yan Shen, Wei Wang, Weichen Lu, and Xinhua Suo. 2021. Capsule boundary network with 3D convolutional dynamic routing for temporal action detection. IEEE Transactions on Circuits and Systems for Video Technology 32, 5 (2021), 2962–2975.
[6]
Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, et al. 2020. Rethinking attention with performers. arXiv:2009.14794. Retrieved from https://arxiv.org/abs/2009.14794
[7]
Zhen Cui, Chunyan Xu, Wenming Zheng, and Jian Yang. 2018. Context-dependent diffusion network for visual relationship detection. In Proceedings of the 26th ACM International Conference on Multimedia, 1475–1482.
[8]
Junyi Dong, Qingze Huo, and Silvia Ferrari. 2022. A holistic approach for role inference and action anticipation in human teams. ACM Transactions on Intelligent Systems and Technology 13, 6 (2022), 1–24.
[9]
Xinzhi Dong, Chengjiang Long, Wenju Xu, and Chunxia Xiao. 2021. Dual graph convolutional networks with transformer and curriculum learning for image captioning. In Proceedings of the 29th ACM International Conference on Multimedia, 2615–2624.
[10]
Luca Franceschi, Mathias Niepert, Massimiliano Pontil, and Xiao He. 2019. Learning discrete structures for graph neural networks. In Proceedings of the International Conference on Machine Learning, 1972–1982.
[11]
Ming-Gang Gan and Yan Zhang. 2022. Temporal attention-pyramid pooling for temporal action detection. IEEE Transactions on Multimedia (2022).
[12]
Ming-Gang Gan, Yan Zhang, and Shaowen Su. 2023. Temporal-visual proposal graph network for temporal action detection. Applied Intelligence 53, 21 (2023), 26008–26026.
[13]
Bo Jiang, Ziyan Zhang, Doudou Lin, Jin Tang, and Bin Luo. 2019. Semi-supervised learning with graph learning-convolutional networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11313–11320.
[14]
Yu-Gang Jiang, Jingen Liu, A. Roshan Zamir, George Toderici, Ivan Laptev, Mubarak Shah, and Rahul Sukthankar. 2014. THUMOS challenge: Action recognition with a large number of classes. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).
[15]
Thomas N. Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. arXiv:1609.02907. Retrieved from https://arxiv.org/abs/1609.02907
[16]
Yu Kong, Zhiqiang Tao, and Yun Fu. 2017. Deep sequential context networks for action prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1473–1481.
[17]
Jin Li, Xianglong Liu, Zhuofan Zong, Wanru Zhao, Mingyuan Zhang, and Jingkuan Song. 2020. Graph attention based proposal 3d convnets for action detection. In Proceedings of the AAAI Conference on Artificial Intelligence, 4626–4633.
[18]
Ruoyu Li, Sheng Wang, Feiyun Zhu, and Junzhou Huang. 2018. Adaptive graph convolutional neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32.
[19]
Xin Li, Tianwei Lin, Xiao Liu, Wangmeng Zuo, Chao Li, Xiang Long, Dongliang He, Fu Li, Shilei Wen, and Chuang Gan. 2020a. Deep concept-wise temporal convolutional networks for action localization. In Proceedings of the 28th ACM International Conference on Multimedia, 4004–4012.
[20]
Chuming Lin, Chengming Xu, Donghao Luo, Yabiao Wang, Ying Tai, Chengjie Wang, Jilin Li, Feiyue Huang, and Yanwei Fu. 2021. Learning salient boundary feature for anchor-free temporal action localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3320–3329.
[21]
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollar. 2017. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, 2980–2988.
[22]
Qinying Liu and Zilei Wang. 2020. Progressive boundary refinement network for temporal action detection. In Proceedings of the AAAI Conference on Artificial Intelligence, 11612–11619.
[23]
Xiaolong Liu, Song Bai, and Xiang Bai. 2022. An empirical study of end-to-end temporal action detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 20010–20019.
[24]
Xiaolong Liu, Yao Hu, Song Bai, Fei Ding, Xiang Bai, and Philip HS Torr. 2021. Multi-shot temporal event localization: A benchmark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 12596–12606.
[25]
Xiaolong Liu, Qimeng Wang, Yao Hu, Xu Tang, Shiwei Zhang, Song Bai, and Xiang Bai. 2022. End-to-end temporal action detection with transformer. IEEE Transactions on Image Processing 31 (2022), 5427–5441.
[26]
Fuchen Long, Ting Yao, Zhaofan Qiu, Xinmei Tian, Jiebo Luo, and Tao Mei. 2019. Gaussian temporal awareness networks for action localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 344–353.
[27]
Li Mi and Zhenzhong Chen. 2020. Hierarchical graph attention network for visual relationship detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 13886–13895.
[28]
Sauradip Nag, Xiatian Zhu, Jiankang Deng, Yi-Zhe Song, and Tao Xiang. 2023. DiffTAD: Temporal action detection with proposal denoising diffusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 10362–10374.
[29]
Sauradip Nag, Xiatian Zhu, Yi-Zhe Song, and Tao Xiang. 2022. Temporal action detection with global segmentation mask learning. In Proceedings of the European Conference on Computer Vision. Springer, 645–662.
[30]
Yifan Ren, Xing Xu, Fumin Shen, Zheng Wang, Yang Yang, and Heng Tao Shen. 2021. Multi-scale dynamic network for temporal action detection. In Proceedings of the International Conference on Multimedia Retrieval, 267–275.
[31]
Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese. 2019. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 658–666.
[32]
Lei Shi, Yifan Zhang, Jian Cheng, and Hanqing Lu. 2019. Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 12026–12035.
[33]
Leslie N. Smith. 2017. Cyclical learning rates for training neural networks. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision. IEEE, 464–472.
[34]
Rui Su, Dong Xu, Lu Sheng, and Wanli Ouyang. 2020. PCG-TAL: Progressive cross-granularity cooperation for temporal action localization. IEEE Transactions on Image Processing 30 (2020), 2103–2113.
[35]
Jing Tan, Jiaqi Tang, Limin Wang, and Gangshan Wu. 2021. Relaxed transformer decoders for direct action proposal generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 13526–13535.
[36]
Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. 2018. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6450–6459.
[37]
Petar Velivckovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. 2017. Graph attention networks. arXiv:1710.10903. Retrieved from https://arxiv.org/abs/1710.10903
[38]
Binglu Wang, Le Yang, and Yongqiang Zhao. 2021. POLO: Learning explicit cross-modality fusion for temporal action localization. IEEE Signal Processing Letters 28 (2021), 503–507.
[39]
Chenhao Wang, Hongxiang Cai, Yuxin Zou, and Yichao Xiong. 2021. RGB stream is enough for temporal action detection. arXiv:2107.04362. Retrieved from https://arxiv.org/abs/2107.04362
[40]
Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. 2016. Temporal segment networks: Towards good practices for deep action recognition. In Proceedings of the European Conference on Computer Vision. Springer, 20–36.
[41]
Qiang Wang, Yanhao Zhang, Yun Zheng, and Pan Pan. 2022. RCL: Recurrent continuous localization for temporal action detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 13566–13575.
[42]
Jiannan Wu, Peize Sun, Shoufa Chen, Jiewen Yang, Zihao Qi, Lan Ma, and Ping Luo. 2021. Towards high-quality temporal action detection with sparse proposals. arXiv:2109.08847. Retrieved from https://arxiv.org/abs/2109.08847
[43]
Kun Xia, Le Wang, Sanping Zhou, Nanning Zheng, and Wei Tang. 2022. Learning to refactor action and co-occurrence features for temporal action localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 13884–13893.
[44]
Mengmeng Xu, Chen Zhao, David S. Rojas, Ali Thabet, and Bernard Ghanem. 2020. G-TAD: Sub-graph localization for temporal action detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10156–10165.
[45]
Sijie Yan, Yuanjun Xiong, and Dahua Lin. 2018. Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence.
[46]
Min Yang, Guo Chen, Yin-Dong Zheng, Tong Lu, and Limin Wang. 2023. BasicTAD: An astounding rgb-only baseline for temporal action detection. Computer Vision and Image Understanding 232 (2023), 103692.
[47]
Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. 2018. Exploring visual relationship for image captioning. In Proceedings of the European Conference on Computer Vision, 684–699.
[48]
Runhao Zeng, Wenbing Huang, Mingkui Tan, Yu Rong, Peilin Zhao, Junzhou Huang, and Chuang Gan. 2019. Graph convolutional networks for temporal action localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 7094–7103.
[49]
Chen-Lin Zhang, Jianxin Wu, and Yin Li. 2022. ActionFormer: Localizing moments of actions with transformers. In Proceedings of the European Conference on Computer Vision. Springer, 492–510.
[50]
Wei Zhang, Binglu Wang, Songhui Ma, Yani Zhang, and Yongqiang Zhao. 2021. I2Net: Mining intra-video and inter-video attention for temporal action localization. Neurocomputing 444 (2021), 16–29.
[51]
Chen Zhao, Ali K. Thabet, and Bernard Ghanem. 2021. Video self-stitching graph network for temporal action localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 13658–13667.
[52]
Yue Zhao, Yuanjun Xiong, Limin Wang, Zhirong Wu, Xiaoou Tang, and Dahua Lin. 2017. Temporal action detection with structured segment networks. In Proceedings of the IEEE International Conference on Computer Vision, 2914–2923.

Index Terms

  1. Proposal Semantic Relationship Graph Network for Temporal Action Detection

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Intelligent Systems and Technology
    ACM Transactions on Intelligent Systems and Technology  Volume 15, Issue 6
    December 2024
    727 pages
    EISSN:2157-6912
    DOI:10.1145/3613712
    • Editor:
    • Huan Liu
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 13 December 2024
    Online AM: 28 October 2024
    Accepted: 12 October 2024
    Revised: 23 August 2024
    Received: 28 November 2023
    Published in TIST Volume 15, Issue 6

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Temporal action detection
    2. proposal semantic relationship graph
    3. graph convolutional network
    4. graph structure learning

    Qualifiers

    • Research-article

    Funding Sources

    • National Key Research and Development Program of China

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 131
      Total Downloads
    • Downloads (Last 12 months)131
    • Downloads (Last 6 weeks)12
    Reflects downloads up to 12 Feb 2025

    Other Metrics

    Citations

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media