{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,22]],"date-time":"2026-04-22T17:54:35Z","timestamp":1776880475857,"version":"3.51.2"},"reference-count":74,"publisher":"Association for Computing Machinery (ACM)","issue":"2","license":[{"start":{"date-parts":[[2023,2,6]],"date-time":"2023-02-06T00:00:00Z","timestamp":1675641600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["62076262, 61673402, 61273270, 60802069"],"award-info":[{"award-number":["62076262, 61673402, 61273270, 60802069"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]},{"name":"Science and Technology Program of Guangdong Province","award":["2021B1101270007, 2019B010140002"],"award-info":[{"award-number":["2021B1101270007, 2019B010140002"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2023,5,31]]},"abstract":"<jats:p>\n            Image multi-label classification task is mainly to correctly predict multiple object categories in the images. To capture the correlation between labels, graph convolution network based methods have to manually count the label co-occurrence probability from training data to construct a pre-defined graph as the input of graph network, which is inflexible and may degrade model generalizability. Moreover, most of the current methods cannot effectively align the learned salient object features with the label concepts, so that the predicted results of model may not be consistent with the image content. Therefore, how to learn the salient semantic features of images and capture the correlation between labels, and then effectively align them is one of the key to improve the performance of image multi-label classification task. To this end, we propose a novel image multi-label classification framework which aims to align\n            <jats:bold>I<\/jats:bold>\n            mage\n            <jats:bold>S<\/jats:bold>\n            emantics with\n            <jats:bold>L<\/jats:bold>\n            abel\n            <jats:bold>C<\/jats:bold>\n            oncepts (\n            <jats:bold>ISLC<\/jats:bold>\n            ). Specifically, we propose a residual encoder to learn salient object features in the images, and exploit the self-attention layer in aligned decoder to automatically capture the correlation between labels. Then, we leverage the cross-attention layers in aligned decoder to align image semantic features with label concepts, so as to make the labels predicted by model more consistent with image content. Finally, the output features of the last layer of residual encoder and aligned decoder are fused to obtain the final output feature for classification. The proposed ISLC model achieves good performance on various prevalent multi-label image datasets such as MS-COCO 2014, PASCAL VOC 2007, VG-500, and NUS-WIDE with 87.2%, 96.9%, 39.4%, and 64.2%, respectively.\n          <\/jats:p>","DOI":"10.1145\/3550278","type":"journal-article","created":{"date-parts":[[2022,7,21]],"date-time":"2022-07-21T12:19:46Z","timestamp":1658405986000},"page":"1-23","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":20,"title":["Aligning Image Semantics and Label Concepts for Image Multi-Label Classification"],"prefix":"10.1145","volume":"19","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-9237-7205","authenticated-orcid":false,"given":"Wei","family":"Zhou","sequence":"first","affiliation":[{"name":"School of Electronics and Information Technology, Sun Yat-sen University, Guangdong, People\u2019s Republic of China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7901-0698","authenticated-orcid":false,"given":"Zhiwu","family":"Xia","sequence":"additional","affiliation":[{"name":"School of Electronics and Information Technology, Sun Yat-sen University, Guangdong, People\u2019s Republic of China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9802-1226","authenticated-orcid":false,"given":"Peng","family":"Dou","sequence":"additional","affiliation":[{"name":"School of Electronics and Information Technology, Sun Yat-sen University, Guangdong, People\u2019s Republic of China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5227-1337","authenticated-orcid":false,"given":"Tao","family":"Su","sequence":"additional","affiliation":[{"name":"School of Electronics and Information Technology, Sun Yat-sen University, Guangdong, People\u2019s Republic of China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4884-323X","authenticated-orcid":false,"given":"Haifeng","family":"Hu","sequence":"additional","affiliation":[{"name":"School of Electronics and Information Technology, Sun Yat-sen University, Guangdong, People\u2019s Republic of China"}]}],"member":"320","published-online":{"date-parts":[[2023,2,6]]},"reference":[{"key":"e_1_3_1_2_2","first-page":"9","volume-title":"Proceedings of the CVPR Workshops","author":"Cevikalp Hakan","year":"2019","unstructured":"Hakan Cevikalp, Burak Benligiray, \u00d6mer Nezih Gerek, and Hasan Saribas. 2019. Semi-supervised robust deep neural networks for multi-label classification. In Proceedings of the CVPR Workshops. 9\u201317."},{"key":"e_1_3_1_3_2","article-title":"Knowledge-guided multi-label few-shot learning for general image recognition","author":"Chen Tianshui","year":"2020","unstructured":"Tianshui Chen, Liang Lin, Xiaolu Hui, Riquan Chen, and Hefeng Wu. 2020. Knowledge-guided multi-label few-shot learning for general image recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 3 (2020), 1371\u20131384.","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"},{"key":"e_1_3_1_4_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00061"},{"key":"e_1_3_1_5_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00532"},{"key":"e_1_3_1_6_2","unstructured":"Xiangxiang Chu Bo Zhang Zhi Tian Xiaolin Wei and Huaxia Xia. 2021. Do we really need explicit position encodings for vision transformers? arXiv:2102.10882. Retrieved from https:\/\/arxiv.org\/abs\/2102.10882."},{"key":"e_1_3_1_7_2","doi-asserted-by":"publisher","DOI":"10.1145\/1646396.1646452"},{"key":"e_1_3_1_8_2","unstructured":"Zihang Dai Zhilin Yang Yiming Yang Jaime Carbonell Quoc V. Le and Ruslan Salakhutdinov. 2019. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv:1901.02860. Retrieved from https:\/\/arxiv.org\/abs\/1901.02860."},{"key":"e_1_3_1_9_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2009.5206848"},{"key":"e_1_3_1_10_2","unstructured":"Alexey Dosovitskiy Lucas Beyer Alexander Kolesnikov Dirk Weissenborn Xiaohua Zhai Thomas Unterthiner Mostafa Dehghani Matthias Minderer Georg Heigold Sylvain Gelly Jakob Uszkoreit and Neil Houlsby. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv:2010.11929. Retrieved from https:\/\/arxiv.org\/abs\/2010.11929."},{"key":"e_1_3_1_11_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.631"},{"key":"e_1_3_1_12_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2017.2788435"},{"key":"e_1_3_1_13_2","first-page":"191","volume-title":"Proceedings of the European Conference on Computer Vision","author":"Dutta Ayushi","year":"2020","unstructured":"Ayushi Dutta, Yashaswi Verma, and C. V. Jawahar. 2020. Recurrent image annotation with explicit inter-label dependencies. In Proceedings of the European Conference on Computer Vision. Springer, 191\u2013207."},{"key":"e_1_3_1_14_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-009-0275-4"},{"key":"e_1_3_1_15_2","doi-asserted-by":"crossref","unstructured":"Bin-Bin Gao and Hong-Yu Zhou. 2021. Learning to discover multi-class attentional regions for multi-label image recognition. IEEE Transactions on Image Processing 30 6 (2021) 5920\u20135932.","DOI":"10.1109\/TIP.2021.3088605"},{"key":"e_1_3_1_16_2","unstructured":"Yunchao Gong Yangqing Jia Thomas Leung Alexander Toshev and Sergey Ioffe. 2013. Deep convolutional ranking for multilabel image annotation. arXiv:1312.4894. Retrieved from https:\/\/arxiv.org\/abs\/1312.4894."},{"key":"e_1_3_1_17_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00082"},{"key":"e_1_3_1_18_2","first-page":"10885","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence","volume":"34","author":"Guo Jinyang","year":"2020","unstructured":"Jinyang Guo, Wanli Ouyang, and Dong Xu. 2020. Channel pruning guided by classification loss and feature importance. In Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 34, 10885\u201310892."},{"key":"e_1_3_1_19_2","first-page":"1508","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Guo Jinyang","year":"2020","unstructured":"Jinyang Guo, Wanli Ouyang, and Dong Xu. 2020. Multi-dimensional pruning: A unified framework for model compression. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 1508\u20131517."},{"issue":"3","key":"e_1_3_1_20_2","first-page":"1114","article-title":"Model compression using progressive channel pruning","volume":"31","author":"Guo Jinyang","year":"2020","unstructured":"Jinyang Guo, Weichen Zhang, Wanli Ouyang, and Dong Xu. 2020. Model compression using progressive channel pruning. IEEE Transactions on Circuits and Systems for Video Technology 31, 3 (2020), 1114\u20131124.","journal-title":"IEEE Transactions on Circuits and Systems for Video Technology"},{"key":"e_1_3_1_21_2","doi-asserted-by":"publisher","DOI":"10.1145\/3436494"},{"key":"e_1_3_1_22_2","doi-asserted-by":"crossref","first-page":"103448","DOI":"10.1016\/j.jvcir.2022.103448","article-title":"Learning discriminative representations for multi-label image recognition","volume":"83","author":"Hassanin Mohammed","year":"2022","unstructured":"Mohammed Hassanin, Ibrahim Radwan, Salman Khan, and Murat Tahtali. 2022. Learning discriminative representations for multi-label image recognition. Journal of Visual Communication and Image Representation 83, C (2022), 103448.","journal-title":"Journal of Visual Communication and Image Representation"},{"key":"e_1_3_1_23_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_3_1_24_2","unstructured":"Ruining He Anirudh Ravula Bhargav Kanagal and Joshua Ainslie. 2020. RealFormer: Transformer likes residual attention. arXiv:2012.11747. Retrieved from https:\/\/arxiv.org\/abs\/2012.11747."},{"key":"e_1_3_1_25_2","doi-asserted-by":"publisher","DOI":"10.1145\/3446208"},{"key":"e_1_3_1_26_2","doi-asserted-by":"publisher","DOI":"10.1145\/3388861"},{"key":"e_1_3_1_27_2","unstructured":"Zhicheng Huang Zhaoyang Zeng Bei Liu Dongmei Fu and Jianlong Fu. 2020. Pixel-bert: Aligning image pixels with text by deep multi-modal transformers. arXiv:2004.00849. Retrieved from https:\/\/arxiv.org\/abs\/2004.00849."},{"issue":"2","key":"e_1_3_1_28_2","first-page":"1","article-title":"A multi-instance multi-label dual learning approach for video captioning","volume":"17","author":"Ji Wanting","year":"2021","unstructured":"Wanting Ji and Ruili Wang. 2021. A multi-instance multi-label dual learning approach for video captioning. ACM Transactions on Multimedia Computing Communications and Applications 17, 2s (2021), 1\u201318.","journal-title":"ACM Transactions on Multimedia Computing Communications and Applications"},{"key":"e_1_3_1_29_2","first-page":"2452","volume-title":"Proceedings of the 2016 23rd International Conference on Pattern Recognition","author":"Jin Jiren","year":"2016","unstructured":"Jiren Jin and Hideki Nakayama. 2016. Annotation order matters: Recurrent image annotator for arbitrary length image tagging. In Proceedings of the 2016 23rd International Conference on Pattern Recognition. IEEE, 2452\u20132457."},{"key":"e_1_3_1_30_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-016-0981-7"},{"key":"e_1_3_1_31_2","first-page":"16478","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Lanchantin Jack","year":"2021","unstructured":"Jack Lanchantin, Tianlu Wang, Vicente Ordonez, and Yanjun Qi. 2021. General multi-label image classification with transformers. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 16478\u201316488."},{"key":"e_1_3_1_32_2","doi-asserted-by":"crossref","unstructured":"Duo Li Anbang Yao and Qifeng Chen. 2020. PSConv: Squeezing feature pyramid into one compact poly-scale convolutional layer. In Proceedings of the European Conference on Computer Vision . Springer 615\u2013632.","DOI":"10.1007\/978-3-030-58589-1_37"},{"key":"e_1_3_1_33_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-61609-0_58"},{"key":"e_1_3_1_34_2","doi-asserted-by":"publisher","DOI":"10.1145\/3359753"},{"key":"e_1_3_1_35_2","unstructured":"Qing Li Xiaojiang Peng Yu Qiao and Qiang Peng. 2019. Learning category correlations for multi-label image recognition with graph networks. arXiv:1909.13005. Retrieved from https:\/\/arxiv.org\/abs\/1909.13005."},{"key":"e_1_3_1_36_2","doi-asserted-by":"publisher","DOI":"10.1145\/3426974"},{"key":"e_1_3_1_37_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-10602-1_48"},{"key":"e_1_3_1_38_2","first-page":"1682","volume-title":"Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing","author":"Liu Luchen","year":"2019","unstructured":"Luchen Liu, Sheng Guo, Weilin Huang, and Matthew R. Scott. 2019. Decoupling category-wise independence and relevance with self-attention for multi-label image classification. In Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 1682\u20131686."},{"key":"e_1_3_1_39_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2017.2651061"},{"key":"e_1_3_1_40_2","first-page":"1","volume-title":"Proceedings of the 2018 IEEE International Smart Cities Conference","author":"Lyu Fan","year":"2018","unstructured":"Fan Lyu, Fuyuan Hu, Victor S. Sheng, Zhengtian Wu, Qiming Fu, and Baochuan Fu. 2018. Coarse to fine: Multi-label image classification with global\/local attention. In Proceedings of the 2018 IEEE International Smart Cities Conference. IEEE, 1\u20137."},{"key":"e_1_3_1_41_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2019.2894964"},{"key":"e_1_3_1_42_2","first-page":"2579","article-title":"Visualizing data using t-SNE","volume":"9","author":"Maaten Laurens van der","year":"2008","unstructured":"Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research 9, Nov (2008), 2579\u20132605.","journal-title":"Journal of Machine Learning Research"},{"key":"e_1_3_1_43_2","doi-asserted-by":"publisher","DOI":"10.1145\/3338533.3366589"},{"key":"e_1_3_1_44_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v35i10.17098"},{"key":"e_1_3_1_45_2","unstructured":"Tao Pu Lixian Yuan Hefeng Wu Tianshui Chen Ling Tian and Liang Lin. 2022. Semantic representation and dependency learning for multi-label image recognition. arXiv:2204.03795. Retrieved from https:\/\/arxiv.org\/abs\/2204.03795."},{"key":"e_1_3_1_46_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.74"},{"key":"e_1_3_1_47_2","first-page":"1","article-title":"An attention-driven multi-label image classification with semantic embedding and graph convolutional networks","author":"Sun Dengdi","year":"2022","unstructured":"Dengdi Sun, Leilei Ma, Zhuanlian Ding, and Bin Luo. 2022. An attention-driven multi-label image classification with semantic embedding and graph convolutional networks. Cognitive Computation 9, 1 (2022), 1\u201312.","journal-title":"Cognitive Computation"},{"key":"e_1_3_1_48_2","first-page":"5998","volume-title":"Proceedings of the 31st International Conference on Neural Information Processing Systems","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, \u0141ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems. 5998\u20136008."},{"key":"e_1_3_1_49_2","unstructured":"Petar Veli\u010dkovi\u0107 Guillem Cucurull Arantxa Casanova Adriana Romero Pietro Lio and Yoshua Bengio. 2017. Graph attention networks. arXiv:1710.10903. Retrieved from https:\/\/arxiv.org\/abs\/1710.10903."},{"key":"e_1_3_1_50_2","doi-asserted-by":"publisher","DOI":"10.1145\/3394171.3414047"},{"key":"e_1_3_1_51_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.251"},{"key":"e_1_3_1_52_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00813"},{"key":"e_1_3_1_53_2","first-page":"1","volume-title":"Proceedings of the 2021 IEEE International Conference on Multimedia and Expo","author":"Wang Xiaomei","year":"2021","unstructured":"Xiaomei Wang, Yaqian Li, Tong Luo, Yandong Guo, Yanwei Fu, and Xiangyang Xue. 2021. Distance restricted transformer encoder for multi-label classification. In Proceedings of the 2021 IEEE International Conference on Multimedia and Expo. IEEE, 1\u20136."},{"key":"e_1_3_1_54_2","first-page":"12265","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence","volume":"34","author":"Wang Ya","year":"2020","unstructured":"Ya Wang, Dongliang He, Fu Li, Xiang Long, Zhichao Zhou, Jinwen Ma, and Shilei Wen. 2020. Multi-label classification with label graph superimposing. In Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 34, 12265\u201312272."},{"key":"e_1_3_1_55_2","doi-asserted-by":"publisher","DOI":"10.1145\/3340531.3411880"},{"key":"e_1_3_1_56_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.58"},{"key":"e_1_3_1_57_2","article-title":"Semantic supplementary network with prior information for multi-label image classification","author":"Wang Zhe","year":"2021","unstructured":"Zhe Wang, Zhongli Fang, Dongdong Li, Hai Yang, and Wenli Du. 2021. Semantic supplementary network with prior information for multi-label image classification. IEEE Transactions on Circuits and Systems for Video Technology 32, 4 (2021), 1848\u20131859.","journal-title":"IEEE Transactions on Circuits and Systems for Video Technology"},{"key":"e_1_3_1_58_2","doi-asserted-by":"publisher","DOI":"10.1145\/3394171.3414046"},{"key":"e_1_3_1_59_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.634"},{"key":"e_1_3_1_60_2","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2019.2929512"},{"key":"e_1_3_1_61_2","first-page":"280","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","author":"Yang Hao","year":"2016","unstructured":"Hao Yang, Joey Tianyi Zhou, Yu Zhang, Bin-Bin Gao, Jianxin Wu, and Jianfei Cai. 2016. Exploit bounding box annotations for multi-label object recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 280\u2013288."},{"key":"e_1_3_1_62_2","first-page":"13440","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Yazici Vacit Oguz","year":"2020","unstructured":"Vacit Oguz Yazici, Abel Gonzalez-Garcia, Arnau Ramisa, Bartlomiej Twardowski, and Joost van de Weijer. 2020. Orderless recurrent models for multi-label classification. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 13440\u201313449."},{"key":"e_1_3_1_63_2","first-page":"649","volume-title":"Proceedings of the 16th European Conference on Computer Vision","author":"Ye Jin","year":"2020","unstructured":"Jin Ye, Junjun He, Xiaojiang Peng, Wenhao Wu, and Yu Qiao. 2020. Attention-driven dynamic graph convolutional network for multi-label image recognition. In Proceedings of the 16th European Conference on Computer Vision. Springer, 649\u2013665."},{"key":"e_1_3_1_64_2","first-page":"12709","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence","author":"You Renchun","year":"2020","unstructured":"Renchun You, Zhiyao Guo, Lei Cui, Xiang Long, Yingze Bao, and Shilei Wen. 2020. Cross-modality attention with semantic graph embedding for multi-label classification. In Proceedings of the AAAI Conference on Artificial Intelligence. 12709\u201312716."},{"key":"e_1_3_1_65_2","doi-asserted-by":"crossref","first-page":"322","DOI":"10.1016\/j.patcog.2019.03.006","article-title":"DELTA: A deep dual-stream network for multi-label image classification","volume":"91","author":"Yu Wan-Jin","year":"2019","unstructured":"Wan-Jin Yu, Zhen-Duo Chen, Xin Luo, Wu Liu, and Xin-Shun Xu. 2019. DELTA: A deep dual-stream network for multi-label image classification. Pattern Recognition 91, C (2019), 322\u2013331.","journal-title":"Pattern Recognition"},{"key":"e_1_3_1_66_2","doi-asserted-by":"crossref","unstructured":"Kun Yuan Shaopeng Guo Ziwei Liu Aojun Zhou Fengwei Yu and Wei Wu. 2021. Incorporating convolution designs into visual transformers. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition . 579\u2013588.","DOI":"10.1109\/ICCV48922.2021.00062"},{"key":"e_1_3_1_67_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2018.2812605"},{"key":"e_1_3_1_68_2","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2020.3044446"},{"key":"e_1_3_1_69_2","first-page":"163","volume-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision","author":"Zhao Jiawei","year":"2021","unstructured":"Jiawei Zhao, Ke Yan, Yifan Zhao, Xiaowei Guo, Feiyue Huang, and Jia Li. 2021. Transformer-based dual relation graph for multi-label image recognition. In Proceedings of the IEEE\/CVF International Conference on Computer Vision. 163\u2013172."},{"issue":"12","key":"e_1_3_1_70_2","doi-asserted-by":"crossref","first-page":"4735","DOI":"10.1109\/TCSVT.2021.3102025","article-title":"Transformer3D-Det: Improving 3D object detection by vote refinement","volume":"31","author":"Zhao Lichen","year":"2021","unstructured":"Lichen Zhao, Jinyang Guo, Dong Xu, and Lu Sheng. 2021. Transformer3D-Det: Improving 3D object detection by vote refinement. IEEE Transactions on Circuits and Systems for Video Technology 31, 12 (2021), 4735\u20134746.","journal-title":"IEEE Transactions on Circuits and Systems for Video Technology"},{"key":"e_1_3_1_71_2","doi-asserted-by":"crossref","unstructured":"Sixiao Zheng Jiachen Lu Hengshuang Zhao Xiatian Zhu Zekun Luo Yabiao Wang Yanwei Fu Jianfeng Feng Tao Xiang Philip H. S. Torr and Li Zhang. 2020. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition . 6881\u20136890.","DOI":"10.1109\/CVPR46437.2021.00681"},{"key":"e_1_3_1_72_2","article-title":"Multi-label image classification via category prototype compositional learning","author":"Zhou Fengtao","year":"2021","unstructured":"Fengtao Zhou, Sheng Huang, Bo Liu, and Dan Yang. 2021. Multi-label image classification via category prototype compositional learning. IEEE Transactions on Circuits and Systems for Video Technology 32, 7 (2021), 4513\u20134525.","journal-title":"IEEE Transactions on Circuits and Systems for Video Technology"},{"key":"e_1_3_1_73_2","doi-asserted-by":"publisher","DOI":"10.1145\/3519030"},{"key":"e_1_3_1_74_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.219"},{"key":"e_1_3_1_75_2","first-page":"184","volume-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision","author":"Zhu Ke","year":"2021","unstructured":"Ke Zhu and Jianxin Wu. 2021. Residual attention: A simple but effective method for multi-label recognition. In Proceedings of the IEEE\/CVF International Conference on Computer Vision. 184\u2013193."}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3550278","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3550278","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T18:43:23Z","timestamp":1750272203000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3550278"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,2,6]]},"references-count":74,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2023,5,31]]}},"alternative-id":["10.1145\/3550278"],"URL":"https:\/\/doi.org\/10.1145\/3550278","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"value":"1551-6857","type":"print"},{"value":"1551-6865","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,2,6]]},"assertion":[{"value":"2022-02-28","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2022-07-19","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-02-06","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}