{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,10]],"date-time":"2026-04-10T10:00:12Z","timestamp":1775815212396,"version":"3.50.1"},"publisher-location":"New York, NY, USA","reference-count":71,"publisher":"ACM","license":[{"start":{"date-parts":[[2021,10,17]],"date-time":"2021-10-17T00:00:00Z","timestamp":1634428800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"the Key Research Program of Frontier Sciences, CAS","award":["ZDBS-LY-7024"],"award-info":[{"award-number":["ZDBS-LY-7024"]}]},{"name":"the Beijing Municipal Science & Technology Commission","award":["Z191100007119002"],"award-info":[{"award-number":["Z191100007119002"]}]},{"name":"the National Natural Science Foundation of China","award":["62006221"],"award-info":[{"award-number":["62006221"]}]},{"name":"CAAI-Huawei MindSpore Open Fund"},{"name":"the Open Research Project of the State Key Laboratory of Media Convergence and Communication, Communication University of China, China","award":["SKLMCC2020KF004"],"award-info":[{"award-number":["SKLMCC2020KF004"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2021,10,17]]},"DOI":"10.1145\/3474085.3475606","type":"proceedings-article","created":{"date-parts":[[2021,10,18]],"date-time":"2021-10-18T06:57:34Z","timestamp":1634540254000},"page":"376-385","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":36,"title":["Beyond OCR + VQA"],"prefix":"10.1145","author":[{"given":"Gangyan","family":"Zeng","sequence":"first","affiliation":[{"name":"Communication University of China, Beijing, China"}]},{"given":"Yuan","family":"Zhang","sequence":"additional","affiliation":[{"name":"Communication University of China, Beijing, China"}]},{"given":"Yu","family":"Zhou","sequence":"additional","affiliation":[{"name":"Institute of Information Engineering, Chinese Academy of Sciences &amp; University of Chinese Academy of Sciences, Beijing, China"}]},{"given":"Xiaomeng","family":"Yang","sequence":"additional","affiliation":[{"name":"Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China"}]}],"member":"320","published-online":{"date-parts":[[2021,10,17]]},"reference":[{"key":"e_1_3_2_2_1_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2014.2339814"},{"key":"e_1_3_2_2_2_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00636"},{"key":"e_1_3_2_2_3_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.279"},{"key":"e_1_3_2_2_4_1","volume-title":"Jamie Ryan Kiros, and Geoffrey E Hinton","author":"Ba Jimmy Lei","year":"2016","unstructured":"Jimmy Lei Ba , Jamie Ryan Kiros, and Geoffrey E Hinton 2016 . Layer normalization. arXiv preprint arXiv:1607.06450 (2016). Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton 2016. Layer normalization. arXiv preprint arXiv:1607.06450 (2016)."},{"key":"e_1_3_2_2_5_1","volume-title":"Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473","author":"Bahdanau Dzmitry","year":"2014","unstructured":"Dzmitry Bahdanau , Kyunghyun Cho , and Yoshua Bengio . 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 ( 2014 ). Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014)."},{"key":"e_1_3_2_2_6_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00439"},{"key":"e_1_3_2_2_7_1","doi-asserted-by":"publisher","DOI":"10.1145\/3219819.3219861"},{"key":"e_1_3_2_2_8_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICPR48806.2021.9412558"},{"key":"e_1_3_2_2_9_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-29894-4_11"},{"key":"e_1_3_2_2_10_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58577-8_7"},{"key":"e_1_3_2_2_11_1","volume-title":"Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). 1724--1734","author":"Cho Kyunghyun","year":"2016","unstructured":"Kyunghyun Cho , Bart Van Merri\u00ebnboer , Caglar Gulcehre , Dzmitry Bahdanau , Fethi Bougares , Holger Schwenk , and Yoshua Bengio . 2016 . Learning phrase representations using RNN encoder-decoder for statistical machine translation . In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). 1724--1734 . Kyunghyun Cho, Bart Van Merri\u00ebnboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2016. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). 1724--1734."},{"key":"e_1_3_2_2_12_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2009.5206848"},{"key":"e_1_3_2_2_13_1","volume-title":"Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL): Human Language Technologies","volume":"1","author":"Devlin Jacob","year":"2019","unstructured":"Jacob Devlin , Ming-Wei Chang , Kenton Lee , and Kristina Toutanova . 2019 . Bert: Pre-training of deep bidirectional transformers for language understanding . In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL): Human Language Technologies , Volume 1 (Long and Short Papers). Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL): Human Language Technologies, Volume 1 (Long and Short Papers)."},{"key":"e_1_3_2_2_14_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D16-1044"},{"key":"e_1_3_2_2_15_1","volume-title":"2020 b. Structured multimodal attentions for textvqa. arXiv preprint arXiv:2006.00753","author":"Gao Chenyu","year":"2020","unstructured":"Chenyu Gao , Qi Zhu , Peng Wang , Hui Li , Yuliang Liu , Anton van den Hengel , and Qi Wu . 2020 b. Structured multimodal attentions for textvqa. arXiv preprint arXiv:2006.00753 ( 2020 ). Chenyu Gao, Qi Zhu, Peng Wang, Hui Li, Yuliang Liu, Anton van den Hengel, and Qi Wu. 2020 b. Structured multimodal attentions for textvqa. arXiv preprint arXiv:2006.00753 (2020)."},{"key":"e_1_3_2_2_16_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.01276"},{"key":"e_1_3_2_2_17_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00680"},{"key":"e_1_3_2_2_18_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.147"},{"key":"e_1_3_2_2_19_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.254"},{"key":"e_1_3_2_2_20_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00380"},{"key":"e_1_3_2_2_21_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.coling-main.278"},{"key":"e_1_3_2_2_22_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_3_2_2_23_1","doi-asserted-by":"publisher","DOI":"10.1162\/neco.1997.9.8.1735"},{"key":"e_1_3_2_2_24_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.01001"},{"key":"e_1_3_2_2_25_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-015-0823-z"},{"key":"e_1_3_2_2_26_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.01028"},{"key":"e_1_3_2_2_27_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/E17-2068"},{"key":"e_1_3_2_2_28_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58545-7_41"},{"key":"e_1_3_2_2_29_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDAR.2015.7333942"},{"key":"e_1_3_2_2_30_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDAR.2013.221"},{"key":"e_1_3_2_2_31_1","volume-title":"International Conference on Learning Representations (ICLR). 4190--4198","author":"Kingma Diederik P","year":"2014","unstructured":"Diederik P Kingma and Jimmy Ba . 2014 . Adam: A method for stochastic optimization . In International Conference on Learning Representations (ICLR). 4190--4198 . Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR). 4190--4198."},{"key":"e_1_3_2_2_32_1","volume-title":"et almbox","author":"Krasin Ivan","year":"2017","unstructured":"Ivan Krasin , Tom Duerig , Neil Alldrin , Vittorio Ferrari , Sami Abu-El-Haija , Alina Kuznetsova , Hassan Rom , Jasper Uijlings , Stefan Popov , Andreas Veit , et almbox . 2017 . Openimages : A public dataset for large-scale multi-label and multi-class image classification. Dataset available from https:\/\/github. com\/openimages, Vol. 2, 3 (2017), 18. Ivan Krasin, Tom Duerig, Neil Alldrin, Vittorio Ferrari, Sami Abu-El-Haija, Alina Kuznetsova, Hassan Rom, Jasper Uijlings, Stefan Popov, Andreas Veit, et almbox. 2017. Openimages: A public dataset for large-scale multi-label and multi-class image classification. Dataset available from https:\/\/github. com\/openimages, Vol. 2, 3 (2017), 18."},{"key":"e_1_3_2_2_33_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-016-0981-7"},{"key":"e_1_3_2_2_34_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v33i01.33018610"},{"key":"e_1_3_2_2_35_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.01041"},{"key":"e_1_3_2_2_36_1","volume-title":"Video 3D Sampling for Self-supervised Representation Learning. arXiv preprint arXiv:2107.03578","author":"Li Wei","year":"2021","unstructured":"Wei Li , Dezhao Luo , Bo Fang , Yu Zhou , and Weiping Wang . 2021. Video 3D Sampling for Self-supervised Representation Learning. arXiv preprint arXiv:2107.03578 ( 2021 ). Wei Li, Dezhao Luo, Bo Fang, Yu Zhou, and Weiping Wang. 2021. Video 3D Sampling for Self-supervised Representation Learning. arXiv preprint arXiv:2107.03578 (2021)."},{"key":"e_1_3_2_2_37_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58577-8_8"},{"key":"e_1_3_2_2_38_1","doi-asserted-by":"publisher","DOI":"10.1145\/3394171.3413924"},{"key":"e_1_3_2_2_39_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00983"},{"key":"e_1_3_2_2_40_1","doi-asserted-by":"publisher","DOI":"10.5555\/3367243.3367463"},{"key":"e_1_3_2_2_41_1","doi-asserted-by":"publisher","DOI":"10.5555\/3454287.3454289"},{"key":"e_1_3_2_2_42_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.01045"},{"key":"e_1_3_2_2_43_1","volume-title":"2020 a. Exploring relations in untrimmed videos for self-supervised learning. arXiv preprint arXiv:2008.02711","author":"Luo Dezhao","year":"2020","unstructured":"Dezhao Luo , Bo Fang , Yu Zhou , Yucan Zhou , Dayan Wu , and Weiping Wang . 2020 a. Exploring relations in untrimmed videos for self-supervised learning. arXiv preprint arXiv:2008.02711 ( 2020 ). Dezhao Luo, Bo Fang, Yu Zhou, Yucan Zhou, Dayan Wu, and Weiping Wang. 2020 a. Exploring relations in untrimmed videos for self-supervised learning. arXiv preprint arXiv:2008.02711 (2020)."},{"key":"e_1_3_2_2_44_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i07.6840"},{"key":"e_1_3_2_2_45_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2013.378"},{"key":"e_1_3_2_2_46_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDAR.2019.00156"},{"key":"e_1_3_2_2_47_1","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/D14-1162"},{"key":"e_1_3_2_2_48_1","volume-title":"Gaussian Constrained Attention Network for Scene Text Recognition. In International Conference on Pattern Recognition (ICPR). 3328--3335","author":"Qiao Zhi","year":"2021","unstructured":"Zhi Qiao , Xugong Qin , Yu Zhou , Fei Yang , and Weiping Wang . 2021 . Gaussian Constrained Attention Network for Scene Text Recognition. In International Conference on Pattern Recognition (ICPR). 3328--3335 . Zhi Qiao, Xugong Qin, Yu Zhou, Fei Yang, and Weiping Wang. 2021. Gaussian Constrained Attention Network for Scene Text Recognition. In International Conference on Pattern Recognition (ICPR). 3328--3335."},{"key":"e_1_3_2_2_49_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.01354"},{"key":"e_1_3_2_2_50_1","volume-title":"IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 4350--4354","author":"Qin Xugong","year":"2021","unstructured":"Xugong Qin , Yu Zhou , Youhui Guo , Dayan Wu , and Weiping Wang . 2021 . FC 2 RN: A Fully Convolutional Corner Refinement Network for Accurate Multi-Oriented Scene Text Detection . In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 4350--4354 . Xugong Qin, Yu Zhou, Youhui Guo, Dayan Wu, and Weiping Wang. 2021. FC 2 RN: A Fully Convolutional Corner Refinement Network for Accurate Multi-Oriented Scene Text Detection. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 4350--4354."},{"key":"e_1_3_2_2_51_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDAR.2019.00095"},{"key":"e_1_3_2_2_52_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2016.2577031"},{"key":"e_1_3_2_2_53_1","volume-title":"Asian Conference on Computer Vision (ACCV). 68--82","author":"Sabir Ahmed","year":"2018","unstructured":"Ahmed Sabir , Francesc Moreno-Noguer , and Llu\u00eds Padr\u00f3 . 2018 . Visual re-ranking with natural language understanding for text spotting . In Asian Conference on Computer Vision (ACCV). 68--82 . Ahmed Sabir, Francesc Moreno-Noguer, and Llu\u00eds Padr\u00f3. 2018. Visual re-ranking with natural language understanding for text spotting. In Asian Conference on Computer Vision (ACCV). 68--82."},{"key":"e_1_3_2_2_54_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D19-1346"},{"key":"e_1_3_2_2_55_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2016.2646371"},{"key":"e_1_3_2_2_56_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2018.2848939"},{"key":"e_1_3_2_2_57_1","unstructured":"Amanpreet Singh Vivek Natarajan Yu Jiang Xinlei Chen Meet Shah Marcus Rohrbach Dhruv Batra and Devi Parikh. 2018. Pythia-a platform for vision & language research. In SysML Workshop Annual Conference on Neural Information Processing Systems (NeurIPS Workshop).  Amanpreet Singh Vivek Natarajan Yu Jiang Xinlei Chen Meet Shah Marcus Rohrbach Dhruv Batra and Devi Parikh. 2018. Pythia-a platform for vision & language research. In SysML Workshop Annual Conference on Neural Information Processing Systems (NeurIPS Workshop)."},{"key":"e_1_3_2_2_58_1","doi-asserted-by":"crossref","unstructured":"Amanpreet Singh Vivek Natarajan Meet Shah Yu Jiang Xinlei Chen Dhruv Batra Devi Parikh and Marcus Rohrbach. 2019 b. Towards vqa models that can read. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 8317--8326.  Amanpreet Singh Vivek Natarajan Meet Shah Yu Jiang Xinlei Chen Dhruv Batra Devi Parikh and Marcus Rohrbach. 2019 b. Towards vqa models that can read. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 8317--8326.","DOI":"10.1109\/CVPR.2019.00851"},{"key":"e_1_3_2_2_59_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00470"},{"key":"e_1_3_2_2_60_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-46484-8_4"},{"key":"e_1_3_2_2_61_1","doi-asserted-by":"publisher","DOI":"10.5555\/3295222.3295349"},{"key":"e_1_3_2_2_62_1","volume-title":"Coco-text: Dataset and benchmark for text detection and recognition in natural images. arXiv preprint arXiv:1601.07140","author":"Veit Andreas","year":"2016","unstructured":"Andreas Veit , Tomas Matera , Lukas Neumann , Jiri Matas , and Serge Belongie . 2016 . Coco-text: Dataset and benchmark for text detection and recognition in natural images. arXiv preprint arXiv:1601.07140 (2016). Andreas Veit, Tomas Matera, Lukas Neumann, Jiri Matas, and Serge Belongie. 2016. Coco-text: Dataset and benchmark for text detection and recognition in natural images. arXiv preprint arXiv:1601.07140 (2016)."},{"key":"e_1_3_2_2_63_1","volume-title":"A simple and robust convolutional-attention network for irregular text recognition. arXiv preprint arXiv:1904.01375","author":"Wang Peng","year":"2019","unstructured":"Peng Wang , Lu Yang , Hui Li , Yuyan Deng , Chunhua Shen , and Yanning Zhang . 2019. A simple and robust convolutional-attention network for irregular text recognition. arXiv preprint arXiv:1904.01375 , Vol. 6 ( 2019 ). Peng Wang, Lu Yang, Hui Li, Yuyan Deng, Chunhua Shen, and Yanning Zhang. 2019. A simple and robust convolutional-attention network for irregular text recognition. arXiv preprint arXiv:1904.01375, Vol. 6 (2019)."},{"key":"e_1_3_2_2_64_1","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 10126--10135","author":"Wang Xinyu","unstructured":"Xinyu Wang , Yuliang Liu , Chunhua Shen , Chun Chet Ng , Canjie Luo , Lianwen Jin , Chee Seng Chan , Anton van den Hengel, and Liangwei Wang. 2020. On the general value of evidence, and bilingual scene-text visual question answering . In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 10126--10135 . Xinyu Wang, Yuliang Liu, Chunhua Shen, Chun Chet Ng, Canjie Luo, Lianwen Jin, Chee Seng Chan, Anton van den Hengel, and Liangwei Wang. 2020. On the general value of evidence, and bilingual scene-text visual question answering. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 10126--10135."},{"key":"e_1_3_2_2_65_1","volume-title":"Multi-View Correlation Distillation for Incremental Object Detection. arXiv preprint arXiv:2107.01787","author":"Yang Dongbao","year":"2021","unstructured":"Dongbao Yang , Yu Zhou , and Weiping Wang . 2021. Multi-View Correlation Distillation for Incremental Object Detection. arXiv preprint arXiv:2107.01787 ( 2021 ). Dongbao Yang, Yu Zhou, and Weiping Wang. 2021. Multi-View Correlation Distillation for Incremental Object Detection. arXiv preprint arXiv:2107.01787 (2021)."},{"key":"e_1_3_2_2_66_1","volume-title":"Two-Level Residual Distillation based Triple Network for Incremental Object Detection. arXiv preprint arXiv:2007.13428","author":"Yang Dongbao","year":"2020","unstructured":"Dongbao Yang , Yu Zhou , Dayan Wu , Can Ma , Fei Yang , and Weiping Wang . 2020. Two-Level Residual Distillation based Triple Network for Incremental Object Detection. arXiv preprint arXiv:2007.13428 ( 2020 ). Dongbao Yang, Yu Zhou, Dayan Wu, Can Ma, Fei Yang, and Weiping Wang. 2020. Two-Level Residual Distillation based Triple Network for Incremental Object Detection. arXiv preprint arXiv:2007.13428 (2020)."},{"key":"e_1_3_2_2_67_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00658"},{"key":"e_1_3_2_2_68_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICPR48806.2021.9412301"},{"key":"e_1_3_2_2_69_1","volume-title":"2021 b. Exploring Instance Relations for Unsupervised Feature Embedding. arXiv preprint arXiv:2105.03341","author":"Zhang Yifei","year":"2021","unstructured":"Yifei Zhang , Yu Zhou , and Weiping Wang . 2021 b. Exploring Instance Relations for Unsupervised Feature Embedding. arXiv preprint arXiv:2105.03341 ( 2021 ). Yifei Zhang, Yu Zhou, and Weiping Wang. 2021 b. Exploring Instance Relations for Unsupervised Feature Embedding. arXiv preprint arXiv:2105.03341 (2021)."},{"key":"e_1_3_2_2_70_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.283"},{"key":"e_1_3_2_2_71_1","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence (AAAI).","author":"Zhu Qi","year":"2021","unstructured":"Qi Zhu , Chenyu Gao , Peng Wang , and Qi Wu . 2021 . Simple is not easy: A simple strong baseline for TextVQA and TextCaps . In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI). Qi Zhu, Chenyu Gao, Peng Wang, and Qi Wu. 2021. Simple is not easy: A simple strong baseline for TextVQA and TextCaps. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI)."}],"event":{"name":"MM '21: ACM Multimedia Conference","location":"Virtual Event China","acronym":"MM '21","sponsor":["SIGMM ACM Special Interest Group on Multimedia"]},"container-title":["Proceedings of the 29th ACM International Conference on Multimedia"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3474085.3475606","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3474085.3475606","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T20:48:24Z","timestamp":1750193304000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3474085.3475606"}},"subtitle":["Involving OCR into the Flow for Robust and Accurate TextVQA"],"short-title":[],"issued":{"date-parts":[[2021,10,17]]},"references-count":71,"alternative-id":["10.1145\/3474085.3475606","10.1145\/3474085"],"URL":"https:\/\/doi.org\/10.1145\/3474085.3475606","relation":{},"subject":[],"published":{"date-parts":[[2021,10,17]]},"assertion":[{"value":"2021-10-17","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}