{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,8]],"date-time":"2026-01-08T00:26:18Z","timestamp":1767831978498,"version":"3.49.0"},"publisher-location":"Cham","reference-count":60,"publisher":"Springer International Publishing","isbn-type":[{"value":"9783031040825","type":"print"},{"value":"9783031040832","type":"electronic"}],"license":[{"start":{"date-parts":[[2022,1,1]],"date-time":"2022-01-01T00:00:00Z","timestamp":1640995200000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2022,4,17]],"date-time":"2022-04-17T00:00:00Z","timestamp":1650153600000},"content-version":"vor","delay-in-days":106,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2022]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Providing explanations in the context of Visual Question Answering (VQA) presents a fundamental problem in machine learning. To obtain detailed insights into the process of generating natural language explanations for VQA, we introduce the large-scale CLEVR-X dataset that extends the CLEVR dataset with natural language explanations. For each image-question pair in the CLEVR dataset, CLEVR-X contains multiple structured textual explanations which are derived from the original scene graphs. By construction, the CLEVR-X explanations are correct and describe the reasoning and visual information that is necessary to answer a given question. We conducted a user study to confirm that the ground-truth explanations in our proposed dataset are indeed complete and relevant. We present baseline results for generating natural language explanations in the context of VQA using two state-of-the-art frameworks on the CLEVR-X dataset. Furthermore, we provide a detailed analysis of the explanation generation quality for different question and answer types. Additionally, we study the influence of using different numbers of ground-truth explanations on the convergence of natural language generation (NLG) metrics. The CLEVR-X dataset is publicly available at<jats:ext-link xmlns:xlink=\"http:\/\/www.w3.org\/1999\/xlink\" ext-link-type=\"uri\" xlink:href=\"https:\/\/github.com\/ExplainableML\/CLEVR-X\">https:\/\/github.com\/ExplainableML\/CLEVR-X<\/jats:ext-link>.<\/jats:p>","DOI":"10.1007\/978-3-031-04083-2_5","type":"book-chapter","created":{"date-parts":[[2022,4,16]],"date-time":"2022-04-16T17:03:23Z","timestamp":1650128603000},"page":"69-88","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":10,"title":["CLEVR-X: A Visual Reasoning Dataset for\u00a0Natural Language Explanations"],"prefix":"10.1007","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-8531-3011","authenticated-orcid":false,"given":"Leonard","family":"Salewski","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5807-0576","authenticated-orcid":false,"given":"A. Sophia","family":"Koepke","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3616-8668","authenticated-orcid":false,"given":"Hendrik P. A.","family":"Lensch","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1432-7747","authenticated-orcid":false,"given":"Zeynep","family":"Akata","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"297","published-online":{"date-parts":[[2022,4,17]]},"reference":[{"key":"5_CR1","doi-asserted-by":"crossref","unstructured":"Agrawal, A., Batra, D., Parikh, D.: Analyzing the behavior of visual question answering models. In: EMNLP, pp. 1955\u20131960. Association for Computational Linguistics (2016)","DOI":"10.18653\/v1\/D16-1203"},{"key":"5_CR2","doi-asserted-by":"crossref","unstructured":"Agrawal, A., Batra, D., Parikh, D., Kembhavi, A.: Don\u2019t just assume; look and answer: overcoming priors for visual question answering. In: CVPR, pp. 4971\u20134980 (2018)","DOI":"10.1109\/CVPR.2018.00522"},{"key":"5_CR3","doi-asserted-by":"publisher","unstructured":"Ahn, L.V., Blum, M., Hopper, N.J., Langford, J.: CAPTCHA: using hard AI problems for security: In: Biham, E. (eds.) EUROCRYPT 2003. LNCS, vol. 2656, pp. 294\u2013311. Springer, Heidelberg (2003). https:\/\/doi.org\/10.1007\/3-540-39200-9_18","DOI":"10.1007\/3-540-39200-9_18"},{"key":"5_CR4","doi-asserted-by":"crossref","unstructured":"Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: CVPR, pp. 6077\u20136086 (2018)","DOI":"10.1109\/CVPR.2018.00636"},{"key":"5_CR5","doi-asserted-by":"crossref","unstructured":"Antol, S., et al.: VQA: Visual Question Answering. In: ICCV, pp. 2425\u20132433 (2015)","DOI":"10.1109\/ICCV.2015.279"},{"key":"5_CR6","doi-asserted-by":"publisher","first-page":"14","DOI":"10.1016\/j.inffus.2021.11.008","volume":"81","author":"L Arras","year":"2022","unstructured":"Arras, L., Osman, A., Samek, W.: CLEVR-XAI: a benchmark dataset for the ground truth evaluation of neural network explanations. Inform. Fusion 81, 14\u201340 (2022)","journal-title":"Inform. Fusion"},{"key":"5_CR7","doi-asserted-by":"publisher","first-page":"82","DOI":"10.1016\/j.inffus.2019.12.012","volume":"58","author":"AB Arrieta","year":"2020","unstructured":"Arrieta, A.B., et al.: Explainable artificial intelligence (xAI): concepts, taxonomies, opportunities and challenges toward responsible AI. Inf. Fusion 58, 82\u2013115 (2020)","journal-title":"Inf. Fusion"},{"key":"5_CR8","doi-asserted-by":"crossref","unstructured":"Bach, S., Binder, A., Montavon, G., Klauschen, F., M\u00fcller, K.R., Samek, W.: On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PLoS ONE 10(7), e0130140 (2015)","DOI":"10.1371\/journal.pone.0130140"},{"key":"5_CR9","unstructured":"Banerjee, S., Lavie, A.: METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: ACL Workshop, pp. 65\u201372 (2005)"},{"key":"5_CR10","unstructured":"Brundage, M., et al.: Toward trustworthy AI development: mechanisms for supporting verifiable claims. arXiv preprint arXiv:2004.07213 (2020)"},{"key":"5_CR11","unstructured":"Camburu, O.M., Rockt\u00e4schel, T., Lukasiewicz, T., Blunsom, P.: e-SNLI: Natural language inference with natural language explanations. In: NeurIPS (2018)"},{"key":"5_CR12","doi-asserted-by":"crossref","unstructured":"Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: CVPR, pp. 248\u2013255 (2009)","DOI":"10.1109\/CVPR.2009.5206848"},{"key":"5_CR13","unstructured":"Do, V., Camburu, O.M., Akata, Z., Lukasiewicz, T.: e-SNLI-VE-2.0: corrected visual-textual entailment with natural language explanations. arXiv preprint arXiv:2004.03744 (2020)"},{"key":"5_CR14","doi-asserted-by":"crossref","unstructured":"Fong, R.C., Patrick, M., Vedaldi, A.: Understanding deep networks via extremal perturbations and smooth masks. In: ICCV, pp. 2950\u20132958 (2019)","DOI":"10.1109\/ICCV.2019.00304"},{"key":"5_CR15","doi-asserted-by":"crossref","unstructured":"Fong, R.C., Vedaldi, A.: Interpretable explanations of black boxes by meaningful perturbation. In: ICCV, pp. 3429\u20133437 (2017)","DOI":"10.1109\/ICCV.2017.371"},{"key":"5_CR16","doi-asserted-by":"crossref","unstructured":"Fukui, A., Park, D.H., Yang, D., Rohrbach, A., Darrell, T., Rohrbach, M.: Multimodal compact bilinear pooling for visual question answering and visual grounding. In: EMNLP, pp. 457\u2013468 (2016)","DOI":"10.18653\/v1\/D16-1044"},{"key":"5_CR17","doi-asserted-by":"crossref","unstructured":"Gilpin, L.H., Bau, D., Yuan, B.Z., Bajwa, A., Specter, M., Kagal, L.: Explaining explanations: an overview of interpretability of machine learning. In: IEEE DSAA, pp. 80\u201389 (2018)","DOI":"10.1109\/DSAA.2018.00018"},{"key":"5_CR18","doi-asserted-by":"crossref","unstructured":"Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the V in VQA matter: elevating the role of image understanding in Visual Question Answering. In: CVPR, pp. 6904\u20136913 (2017)","DOI":"10.1109\/CVPR.2017.670"},{"key":"5_CR19","doi-asserted-by":"crossref","unstructured":"He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770\u2013778 (2016)","DOI":"10.1109\/CVPR.2016.90"},{"key":"5_CR20","unstructured":"Hendricks, L.A., Hu, R., Darrell, T., Akata, Z.: Generating counterfactual explanations with natural language. arXiv preprint arXiv:1806.09809 (2018)"},{"key":"5_CR21","doi-asserted-by":"crossref","unstructured":"Hendricks, L.A., Hu, R., Darrell, T., Akata, Z.: Grounding visual explanations. In: ECCV, pp. 264\u2013279 (2018)","DOI":"10.1007\/978-3-030-01216-8_17"},{"key":"5_CR22","unstructured":"Holzinger, A., Saranti, A., Mueller, H.: KANDINSKYpatterns - an experimental exploration environment for pattern analysis and machine intelligence. arXiv preprint arXiv:2103.00519 (2021)"},{"key":"5_CR23","unstructured":"Hudson, D., Manning, C.D.: Learning by abstraction: the neural state machine. In: NeurIPS (2019)"},{"key":"5_CR24","unstructured":"Hudson, D.A., Manning, C.D.: Compositional attention networks for machine reasoning. In: ICLR (2018)"},{"key":"5_CR25","doi-asserted-by":"crossref","unstructured":"Hudson, D.A., Manning, C.D.: GQA: a new dataset for real-world visual reasoning and compositional question answering. In: CVPR, pp. 6693\u20136702 (2019)","DOI":"10.1109\/CVPR.2019.00686"},{"key":"5_CR26","doi-asserted-by":"crossref","unstructured":"Park, D.H., et al.: Multimodal explanations: justifying decisions and pointing to the evidence. In: CVPR, pp. 8779\u20138788 (2018)","DOI":"10.1109\/CVPR.2018.00915"},{"key":"5_CR27","doi-asserted-by":"crossref","unstructured":"Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L., Zitnick, C.L., Girshick, R.B.: Clevr: a diagnostic dataset for compositional language and elementary visual reasoning. In: CVPR, pp. 2901\u20132910 (2017)","DOI":"10.1109\/CVPR.2017.215"},{"key":"5_CR28","doi-asserted-by":"crossref","unstructured":"Johnson, J., et al.: Inferring and executing programs for visual reasoning. In: ICCV, pp. 2989\u20132998 (2017)","DOI":"10.1109\/ICCV.2017.325"},{"key":"5_CR29","doi-asserted-by":"crossref","unstructured":"Kayser, M., Camburu, O.M., Salewski, L., Emde, C., Do, V., Akata, Z., Lukasiewicz, T.: e-VIL: a dataset and benchmark for natural language explanations in vision-language tasks. In: ICCV, pp. 1244\u20131254 (2021)","DOI":"10.1109\/ICCV48922.2021.00128"},{"key":"5_CR30","doi-asserted-by":"crossref","unstructured":"Kim, J.M., Choe, J., Akata, Z., Oh, S.J.: Keep calm and improve visual feature attribution. In: ICCV, pp. 8350\u20138360 (2021)","DOI":"10.1109\/ICCV48922.2021.00824"},{"key":"5_CR31","doi-asserted-by":"crossref","unstructured":"Kim, J., Rohrbach, A., Darrell, T., Canny, J., Akata, Z.: Textual explanations for self-driving vehicles. In: ECCV, pp. 563\u2013578 (2018)","DOI":"10.1007\/978-3-030-01216-8_35"},{"key":"5_CR32","doi-asserted-by":"crossref","unstructured":"Kim, S.S., Meister, N., Ramaswamy, V.V., Fong, R., Russakovsky, O.: Hive: evaluating the human interpretability of visual explanations. arXiv preprint arXiv:2112.03184 (2021)","DOI":"10.1007\/978-3-031-19775-8_17"},{"key":"5_CR33","unstructured":"Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)"},{"key":"5_CR34","doi-asserted-by":"crossref","unstructured":"Kottur, S., Moura, J.M., Parikh, D., Batra, D., Rohrbach, M.: CLEVR-dialog: a diagnostic dataset for multi-round reasoning in visual dialog. In: NAACL, pp. 582\u2013595 (2019)","DOI":"10.18653\/v1\/N19-1058"},{"key":"5_CR35","doi-asserted-by":"crossref","unstructured":"Li, Q., Tao, Q., Joty, S., Cai, J., Luo, J.: VQA-E: explaining, elaborating, and enhancing your answers for visual questions. In: ECCV, pp. 552\u2013567 (2018)","DOI":"10.1007\/978-3-030-01234-2_34"},{"key":"5_CR36","unstructured":"Lin, C.Y.: ROUGE: a package for automatic evaluation of summaries. In: ACL, pp. 74\u201381 (2004)"},{"key":"5_CR37","doi-asserted-by":"crossref","unstructured":"Liu, R., Liu, C., Bai, Y., Yuille, A.L.: CLEVR-Ref+: diagnosing visual reasoning with referring expressions. In: CVPR, pp. 4185\u20134194 (2019)","DOI":"10.1109\/CVPR.2019.00431"},{"key":"5_CR38","doi-asserted-by":"crossref","unstructured":"Marasovi\u0107, A., Bhagavatula, C., Park, J.s., Le Bras, R., Smith, N.A., Choi, Y.: Natural language rationales with full-stack visual reasoning: from pixels to semantic frames to commonsense graphs. In: EMNLP, pp. 2810\u20132829 (2020)","DOI":"10.18653\/v1\/2020.findings-emnlp.253"},{"key":"5_CR39","doi-asserted-by":"crossref","unstructured":"Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: ACL, pp. 311\u2013318 (2002)","DOI":"10.3115\/1073083.1073135"},{"key":"5_CR40","doi-asserted-by":"crossref","unstructured":"Patro, B., Patel, S., Namboodiri, V.: Robust explanations for visual question answering. In: WACV, pp. 1577\u20131586 (2020)","DOI":"10.1109\/WACV45572.2020.9093295"},{"key":"5_CR41","doi-asserted-by":"crossref","unstructured":"Perez, E., Strub, F., De Vries, H., Dumoulin, V., Courville, A.: FiLM: visual Reasoning with a general conditioning layer. In: AAAI, vol. 32 (2018)","DOI":"10.1609\/aaai.v32i1.11671"},{"key":"5_CR42","unstructured":"Petsiuk, V., Das, A., Saenko, K.: Rise: randomized input sampling for explanation of black-box models. In: BMVC, p. 151 (2018)"},{"key":"5_CR43","doi-asserted-by":"crossref","unstructured":"Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-cam: visual explanations from deep networks via gradient-based localization. In: ICCV, pp. 618\u2013626 (2017)","DOI":"10.1109\/ICCV.2017.74"},{"key":"5_CR44","doi-asserted-by":"crossref","unstructured":"Shi, J., Zhang, H., Li, J.: Explainable and explicit visual reasoning over scene graphs. In: CVPR, pp. 8376\u20138384 (2019)","DOI":"10.1109\/CVPR.2019.00857"},{"key":"5_CR45","doi-asserted-by":"crossref","unstructured":"Shih, K.J., Singh, S., Hoiem, D.: Where to look: Focus regions for visual question answering. In: CVPR. pp. 4613\u20134621 (2016)","DOI":"10.1109\/CVPR.2016.499"},{"key":"5_CR46","unstructured":"Simonyan, K., Vedaldi, A., Zisserman, A.: Deep inside convolutional networks: Visualising image classification models and saliency maps. In: ICLR Workshop (2014)"},{"key":"5_CR47","unstructured":"Suarez, J., Johnson, J., Li, F.F.: Ddrprog: a clevr differentiable dynamic reasoning programmer. arXiv preprint arXiv:1803.11361 (2018)"},{"key":"5_CR48","unstructured":"Trott, A., Xiong, C., Socher, R.: Interpretable counting for visual question answering. In: ICLR (2018)"},{"key":"5_CR49","unstructured":"Vedantam, R., Szlam, A., Nickel, M., Morcos, A., Lake, B.M.: CURI: a benchmark for productive concept learning under uncertainty. In: ICML, pp. 10519\u201310529 (2021)"},{"key":"5_CR50","doi-asserted-by":"crossref","unstructured":"Vedantam, R., Zitnick, C.L., Parikh, D.: Cider: consensus-based image description evaluation. In: CVPR, pp. 4566\u20134575 (2015)","DOI":"10.1109\/CVPR.2015.7299087"},{"key":"5_CR51","unstructured":"de Vries, H., Bahdanau, D., Murty, S., Courville, A.C., Beaudoin, P.: CLOSURE: assessing systematic generalization of CLEVR models. In: NeurIPS Workshop (2019)"},{"key":"5_CR52","unstructured":"Wu, J., Chen, L., Mooney, R.: Improving VQA and its explanations by comparing competing explanations. In: AAAI Workshop (2021)"},{"key":"5_CR53","doi-asserted-by":"crossref","unstructured":"Wu, J., Mooney, R.: Faithful multimodal explanation for visual question answering. In: ACL Workshop, pp. 103\u2013112 (2019)","DOI":"10.18653\/v1\/W19-4812"},{"key":"5_CR54","unstructured":"Xie, N., Lai, F., Doran, D., Kadav, A.: Visual entailment: a novel task for fine-grained image understanding. arXiv preprint arXiv:1901.06706 (2019)"},{"key":"5_CR55","doi-asserted-by":"crossref","unstructured":"Xu, H., Saenko, K.: Ask, attend and answer: exploring question-guided spatial attention for visual question answering. In: ECCV, pp. 451\u2013466 (2016)","DOI":"10.1007\/978-3-319-46478-7_28"},{"key":"5_CR56","doi-asserted-by":"crossref","unstructured":"Yang, Z., He, X., Gao, J., Deng, L., Smola, A.: Stacked attention networks for image question answering. In: CVPR, pp. 21\u201329 (2016)","DOI":"10.1109\/CVPR.2016.10"},{"key":"5_CR57","doi-asserted-by":"publisher","unstructured":"Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8689, pp. 818\u2013833. Springer, Cham (2014). https:\/\/doi.org\/10.1007\/978-3-319-10590-1_53","DOI":"10.1007\/978-3-319-10590-1_53"},{"issue":"10","key":"5_CR58","doi-asserted-by":"publisher","first-page":"1084","DOI":"10.1007\/s11263-017-1059-x","volume":"126","author":"J Zhang","year":"2018","unstructured":"Zhang, J., Bargal, S.A., Lin, Z., Brandt, J., Shen, X., Sclaroff, S.: Top-down neural attention by excitation backprop. Int. J. Comput. Vision 126(10), 1084\u20131102 (2018). https:\/\/doi.org\/10.1007\/s11263-017-1059-x","journal-title":"Int. J. Comput. Vision"},{"key":"5_CR59","doi-asserted-by":"crossref","unstructured":"Zhang, P., Goyal, Y., Summers-Stay, D., Batra, D., Parikh, D.: Yin and yang: balancing and answering binary visual questions. In: CVPR, pp. 5014\u20135022 (2016)","DOI":"10.1109\/CVPR.2016.542"},{"key":"5_CR60","doi-asserted-by":"crossref","unstructured":"Zhu, Y., Groth, O., Bernstein, M., Fei-Fei, L.: Visual7w: grounded question answering in images. In: CVPR, pp. 4995\u20135004 (2016)","DOI":"10.1109\/CVPR.2016.540"}],"container-title":["Lecture Notes in Computer Science","xxAI - Beyond Explainable AI"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/978-3-031-04083-2_5","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,9,22]],"date-time":"2024-09-22T10:06:05Z","timestamp":1726999565000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/978-3-031-04083-2_5"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022]]},"ISBN":["9783031040825","9783031040832"],"references-count":60,"URL":"https:\/\/doi.org\/10.1007\/978-3-031-04083-2_5","relation":{},"ISSN":["0302-9743","1611-3349"],"issn-type":[{"value":"0302-9743","type":"print"},{"value":"1611-3349","type":"electronic"}],"subject":[],"published":{"date-parts":[[2022]]},"assertion":[{"value":"17 April 2022","order":1,"name":"first_online","label":"First Online","group":{"name":"ChapterHistory","label":"Chapter History"}},{"value":"xxAI","order":1,"name":"conference_acronym","label":"Conference Acronym","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"International Workshop on Extending Explainable AI Beyond Deep Models and Classifiers","order":2,"name":"conference_name","label":"Conference Name","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"Vienna","order":3,"name":"conference_city","label":"Conference City","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"Austria","order":4,"name":"conference_country","label":"Conference Country","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"2020","order":5,"name":"conference_year","label":"Conference Year","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"18 July 2020","order":7,"name":"conference_start_date","label":"Conference Start Date","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"18 July 2020","order":8,"name":"conference_end_date","label":"Conference End Date","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"xxai2020","order":10,"name":"conference_id","label":"Conference ID","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"https:\/\/human-centered.ai\/xxai-icml-2020\/","order":11,"name":"conference_url","label":"Conference URL","group":{"name":"ConferenceInfo","label":"Conference Information"}}]}}