skip to main content
research-article

Perceiving Actions via Temporal Video Frame Pairs

Published: 17 May 2024 Publication History

Abstract

Video action recognition aims at classifying the action category in given videos. In general, semantic-relevant video frame pairs reflect significant action patterns such as object appearance variation and abstract temporal concepts like speed, rhythm, and so on. However, existing action recognition approaches tend to holistically extract spatiotemporal features. Though effective, there is still a risk of neglecting the crucial action features occurring across frames with a long-term temporal span. Motivated by this, in this article, we propose to perceive actions via frame pairs directly and devise a novel Nest Structure with frame pairs as basic units. Specifically, we decompose a video sequence into all possible frame pairs and hierarchically organize them according to temporal frequency and order, thus transforming the original video sequence into a Nest Structure. Through naturally decomposing actions, the proposed structure can flexibly adapt to diverse action variations such as speed or rhythm changes. Next, we devise a Temporal Pair Analysis module (TPA) to extract discriminative action patterns based on the proposed Nest Structure. The designed TPA module consists of a pair calculation part to calculate the pair features and a pair fusion part to hierarchically fuse the pair features for recognizing actions. The proposed TPA can be flexibly integrated into existing backbones, serving as a side branch to capture various action patterns from multi-level features. Extensive experiments show that the proposed TPA module can achieve consistent improvements over several typical backbones, reaching or updating CNN-based SOTA results on several challenging action recognition benchmarks.

References

[1]
Humam Alwassel, Silvio Giancola, and Bernard Ghanem. 2021. Tsp: Temporally-sensitive pretraining of video encoders for localization tasks. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3173–3183.
[2]
Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299–6308.
[3]
Junyoung Chung, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. In NIPS 2014 Workshop on Deep Learning.
[4]
Iulia Duta, Andrei Nicolicioiu, and Marius Leordeanu. 2020. Dynamic regions graph neural networks for spatio-temporal reasoning. In ORLR Workshop, Neural Information Processing Systems (NeurIPS’20).
[5]
Nour Eldin Elmadany, Yifeng He, and Ling Guan. 2021. Improving action recognition via temporal and complementary learning. ACM Transactions on Intelligent Systems and Technology 12, 3 (2021), 1–24.
[6]
Quanfu Fan, Chun-Fu Richard Chen, Hilde Kuehne, Marco Pistoia, and David Cox. 2019. More is less: Learning efficient video representations by big-little network and depthwise temporal aggregation. Advances in Neural Information Processing Systems 32 (2019).
[7]
Christoph Feichtenhofer. 2020. X3D: Expanding architectures for efficient video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 203–213.
[8]
Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. Slowfast networks for video recognition. In Proceedings of the IEEE International Conference on Computer Vision. 6202–6211.
[9]
Amy Fire and Song-Chun Zhu. 2015. Learning perceptual causality from video. ACM Transactions on Intelligent Systems and Technology 7, 2 (2015), 1–22.
[10]
Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, Florian Hoppe, Christian Thurau, Ingo Bax, and Roland Memisevic. 2017. Proceedings of the IEEE International Conference on Computer Vision (ICCV’17). 5842–5850.
[11]
Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. 2018. Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 6546–6555.
[12]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.
[13]
Berthold K. P. Horn and Brian G. Schunck. 1981. Determining optical flow. Artificial Intelligence 17, 1-3 (1981), 185–203.
[14]
Guoxi Huang and Adrian G. Bors. 2021. Region-based non-local operation for video classification. In Proceedings of the 2020 25th International Conference on Pattern Recognition.IEEE, 10010–10017.
[15]
Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 2012. 3D convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 1 (2012), 221–231.
[16]
Boyuan Jiang, MengMeng Wang, Weihao Gan, Wei Wu, and Junjie Yan. 2019. Stm: Spatiotemporal and motion encoding for action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2000–2009.
[17]
Gagan Kanojia, Sudhakar Kumawat, and Shanmuganathan Raman. 2019. Attentive spatio-temporal representation learning for diving classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 0–0.
[18]
Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. 2014. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 1725–1732.
[19]
Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, and Andrew Zisserman. 2017. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950.
[20]
Takashi Kitamura, Chen Sun, Jared Martin, Lacey J. Kitch, Mark J. Schnitzer, and Susumu Tonegawa. 2015. Entorhinal cortical ocean cells encode specific contexts and drive context-specific fear memory. Neuron 87, 6 (2015), 1317–1331.
[21]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems 25 (2012), 1097–1105.
[22]
H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. 2011. HMDB: A large video database for human motion recognition. In Proceedings of the IEEE International Conference on Computer Vision.
[23]
Heeseung Kwon, Manjin Kim, Suha Kwak, and Minsu Cho. 2020. Motionsqueeze: Neural motion feature learning for video understanding. In Proceedings of the European Conference on Computer Vision. Springer, 345–362.
[24]
Zhengzhong Lan, Ming Lin, Xuanchong Li, Alex G. Hauptmann, and Bhiksha Raj. 2015. Beyond gaussian pyramid: Multi-skip feature stacking for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 204–212.
[25]
Pilhyeon Lee and Hyeran Byun. 2021. Learning action completeness from points for weakly-supervised temporal action localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 13648–13657.
[26]
Dong Li, Zhaofan Qiu, Yingwei Pan, Ting Yao, Houqiang Li, and Tao Mei. 2021. Representing videos as discriminative sub-graphs for action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3310–3319.
[27]
Kunchang Li, Xianhang Li, Yali Wang, Jun Wang, and Yu Qiao. 2020. CT-Net: Channel tensorization network for video classification. In International Conference on Learning Representations.
[28]
Rong-Chang Li, Xiao-Jun Wu, Cong Wu, Tian-Yang Xu, and Josef Kittler. 2021. Dynamic information enhancement for video classification. Image and Vision Computing 114 (2021), 104244.
[29]
Rong-Chang Li, Tianyang Xu, Xiao-Jun Wu, and Josef Kittler. 2021. Video is graph: Structured graph module for video action recognition. arXiv:2110.05904. Retrieved from https://arxiv.org/abs/2110.05904
[30]
Yan Li, Bin Ji, Xintian Shi, Jianguo Zhang, Bin Kang, and Limin Wang. 2020. TEA: Temporal excitation and aggregation for action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 909–918.
[31]
Yingwei Li, Yi Li, and Nuno Vasconcelos. 2018. Resound: Towards action recognition without representation bias. In Proceedings of the European Conference on Computer Vision. 513–528.
[32]
Ji Lin, Chuang Gan, and Song Han. 2019. Tsm: Temporal shift module for efficient video understanding. In Proceedings of the IEEE International Conference on Computer Vision. 7083–7093.
[33]
Zhaoyang Liu, Donghao Luo, Yabiao Wang, Limin Wang, Ying Tai, Chengjie Wang, Jilin Li, Feiyue Huang, and Tong Lu. 2020. TEINet: Towards an efficient architecture for video recognition. In Proceedings of the AAAI. 11669–11676.
[34]
Zhaoyang Liu, Limin Wang, Wayne Wu, Chen Qian, and Tong Lu. 2021. Tam: Temporal adaptive module for video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 13708–13718.
[35]
Chenxu Luo and Alan L. Yuille. 2019. Grouped spatial-temporal aggregation for efficient action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5512–5521.
[36]
Huaishao Luo, Lei Ji, Botian Shi, Haoyang Huang, Nan Duan, Tianrui Li, Jason Li, Taroon Bharti, and Ming Zhou. 2020. Univl: A unified video and language pre-training model for multimodal understanding and generation. arXiv:2002.06353. Retrieved from https://arxiv.org/abs/2002.06353
[37]
Antoine Miech, Jean-Baptiste Alayrac, Lucas Smaira, Ivan Laptev, Josef Sivic, and Andrew Zisserman. 2020. End-to-end learning of visual representations from uncurated instructional videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9879–9889.
[38]
Xiaojiang Peng, Changqing Zou, Yu Qiao, and Qiang Peng. 2014. Action recognition with stacked fisher vectors. In Proceedings of the European Conference on Computer Vision. Springer, 581–595.
[39]
Xin Qin, Jindong Wang, Yiqiang Chen, Wang Lu, and Xinlong Jiang. 2022. Domain generalization for activity recognition via adaptive feature fusion. ACM Transactions on Intelligent Systems and Technology 14, 1 (2022), 1–21.
[40]
Zhiwu Qing, Haisheng Su, Weihao Gan, Dongliang Wang, Wei Wu, Xiang Wang, Yu Qiao, Junjie Yan, Changxin Gao, and Nong Sang. 2021. Temporal context aggregation network for temporal action proposal refinement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 485–494.
[41]
Zhaofan Qiu, Ting Yao, and Tao Mei. 2017. Learning spatio-temporal representation with pseudo-3d residual networks. In Proceedings of the IEEE International Conference on Computer Vision. 5533–5541.
[42]
Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. 2017. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision. 618–626.
[43]
Dian Shao, Yue Zhao, Bo Dai, and Dahua Lin. 2020. Finegym: A hierarchical video dataset for fine-grained action understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2616–2625.
[44]
Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. Advances in Neural Information Processing Systems 27 (2014).
[45]
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556. Retrieved from https://arxiv.org/abs/1409.1556
[46]
Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402.
[47]
Swathikiran Sudhakaran, Sergio Escalera, and Oswald Lanz. 2020. Gate-shift networks for video action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1102–1111.
[48]
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1–9.
[49]
Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision. 4489–4497.
[50]
Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. 2018. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the CVPR. 6450–6459.
[51]
Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. 2018. Graph attention networks. In International Conference on Learning Representations.
[52]
Heng Wang and Cordelia Schmid. 2013. Action recognition with improved trajectories. In Proceedings of the IEEE International Conference on Computer Vision. 3551–3558.
[53]
Heng Wang, Du Tran, Lorenzo Torresani, and Matt Feiszli. 2020. Video modeling with correlation networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 352–361.
[54]
Limin Wang, Zhan Tong, Bin Ji, and Gangshan Wu. 2021. TDN: Temporal difference networks for efficient action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1895–1904.
[55]
Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. 2016. Temporal segment networks: Towards good practices for deep action recognition. In Proceedings of the European Conference on Computer Vision. Springer, 20–36.
[56]
Teng Wang, Ruimao Zhang, Zhichao Lu, Feng Zheng, Ran Cheng, and Ping Luo. 2021. End-to-end dense video captioning with parallel decoding. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6847–6857.
[57]
Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7794–7803.
[58]
Zhengwei Wang, Qi She, and Aljosa Smolic. 2021. Action-net: Multipath excitation for action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13214–13223.
[59]
Junwu Weng, Donghao Luo, Yabiao Wang, Ying Tai, Chengjie Wang, Jilin Li, Feiyue Huang, Xudong Jiang, and Junsong Yuan. 2020. Temporal distinct representation learning for action recognition. In Proceedings of the European Conference on Computer Vision. Springer, 363–378.
[60]
Saining Xie, Chen Sun, Jonathan Huang, Zhuowen Tu, and Kevin Murphy. 2018. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In Proceedings of the European Conference on Computer Vision. 305–321.
[61]
Ceyuan Yang, Yinghao Xu, Jianping Shi, Bo Dai, and Bolei Zhou. 2020. Temporal pyramid network for action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 591–600.
[62]
Jingran Zhang, Fumin Shen, Xing Xu, and Heng Tao Shen. 2020. Temporal reasoning graph for activity recognition. IEEE Transactions on Image Processing 29 (2020), 5491–5506.
[63]
S. Zhang, S. Guo, W. Huang, M. R. Scott, and L. Wang. 2020. V4D:4D convolutional neural networks for video-level representation learning. In Proceedings of the ICLR 2020.
[64]
Yue Zhao, Yuanjun Xiong, and Dahua Lin. 2018. Trajectory convolution for action recognition. Advances in Neural Information Processing Systems 31 (2018).
[65]
Bolei Zhou, Alex Andonian, Aude Oliva, and Antonio Torralba. 2018. Temporal relational reasoning in videos. In Proceedings of the European Conference on Computer Vision (ECCV). 803–818.
[66]
Yu Zhu, Wenbin Chen, and Guodong Guo. 2015. Fusing multiple features for depth-based action recognition. ACM Transactions on Intelligent Systems and Technology (TIST) 6, 2 (2015), 1–20.
[67]
Yisheng Zhu, Hu Han, Guangcan Liu, and Qingshan Liu. 2021. Collaborative local-global learning for temporal action proposal. ACM Transactions on Intelligent Systems and Technology (TIST) 12, 5 (2021), 1–14.
[68]
M. Zolfaghari, K. Singh, and T. Brox. 2018. ECO: Efficient convolutional network for online video understanding. European Conference on Computer Vision (2018).

Cited By

View all
  • (2024)C2C: Component-to-Composition Learning for Zero-Shot Compositional Action RecognitionComputer Vision – ECCV 202410.1007/978-3-031-72920-1_21(369-388)Online publication date: 29-Sep-2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Intelligent Systems and Technology
ACM Transactions on Intelligent Systems and Technology  Volume 15, Issue 3
June 2024
646 pages
EISSN:2157-6912
DOI:10.1145/3613609
  • Editor:
  • Huan Liu
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 May 2024
Online AM: 17 March 2024
Accepted: 21 February 2024
Revised: 24 December 2023
Received: 03 March 2023
Published in TIST Volume 15, Issue 3

Check for updates

Author Tags

  1. Video understanding
  2. action recognition
  3. temporal modelling

Qualifiers

  • Research-article

Funding Sources

  • National Key Research and Development Program of China
  • National Natural Science Foundation of China
  • 111 Project of Ministry of Education of China
  • Engineering and Physical Sciences Research Council

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)158
  • Downloads (Last 6 weeks)2
Reflects downloads up to 12 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)C2C: Component-to-Composition Learning for Zero-Shot Compositional Action RecognitionComputer Vision – ECCV 202410.1007/978-3-031-72920-1_21(369-388)Online publication date: 29-Sep-2024

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media