This repository lists papers on large language model with causality. The corresponding survey paper is available as a preprint on arXiv.
Contributor: Anpeng Wu.
- Why do large language models (LLMs) require causality?
- How can causality improve the performance of LLMs?
- Do LLMs have the ability to understand and apply causal reasoning?
- What are the primary challenges faced by LLMs? How can causality help address these challenges?
- Sara Abdali, Anjali Parikh, Steve Lim, and Emre Kiciman. Extracting self-consistent causal insights from users feedback with llms and in-context learning. arXiv preprint arXiv:2312.06820, 2023.
- Alessandro Antonucci, Gregorio Piqué, and Marco Zaffalon. Zero-shot causal graph extrapolation from text via llms. arXiv preprint arXiv:2312.14670, 2023.
- Taiyu Ban, Lyvzhou Chen, Xiangyu Wang, and Huanhuan Chen. From query tools to causal architects: Harnessing large language models for advanced causal discovery from data. arXiv preprint arXiv:2306.16902, 2023.
- Amrita Bhattacharjee, Raha Moraffah, Joshua Garland, and Huan Liu. Llms as counterfactual explanation modules: Can chatgpt explain black-box text classifiers? In AAAI 2024 Workshop on Responsible Language Models, Vancouver, BC, Canada, 2024.
- Boxi Cao, Hongyu Lin, Xianpei Han, Fangchao Liu, and Le Sun. Can prompt probe pretrained language models? understanding the invisible risks from a causal view. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5796–5808, 2022.
- Tommaso Caselli and Piek Vossen. The event storyline corpus: A new benchmark for causal and temporal relation extraction. In Proceedings of the Events and Stories in the News Workshop, pages 77–86, 2017.
- Hang Chen, Bingyu Liao, Jing Luo, Wenjing Zhu, and Xinyu Yang. Learning a structural causal model for intuition reasoning in conversation. IEEE Transactions on Knowledge and Data Engineering, 2024.
- Sirui Chen, Mengying Xu, Kun Wang, Xingyu Zeng, Rui Zhao, Shengjie Zhao, and Chaochao Lu. Clear: Can language models really understand causal graphs? arXiv preprint arXiv:2406.16605, 2024.
- Zeming Chen, Qiyue Gao, Antoine Bosselut, Ashish Sabharwal, and Kyle Richardson. Disco: Distilling counterfactuals with large language models. In The 61st Annual Meeting Of The Association For Computational Linguistics, 2023.
- Lei Ding, Dengdeng Yu, Jinhan Xie, Wenxing Guo, Shenggang Hu, Meichen Liu, Linglong Kong, Hongsheng Dai, Yanchun Bao, and Bei Jiang. Word embeddings via causal inference: Gender bias reducing and semantic information preserving. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 11864–11872, 2022.
- Li Du, Xiao Ding, Kai Xiong, Ting Liu, and Bing Qin. e-care: a new dataset for exploring explainable causal reasoning. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 432–446, 2022.
- Hao Duong Le, Xin Xia, and Zhang Chen. Multi-agent causal discovery using large language models. arXiv e-prints, pages arXiv–2407, 2024.
- Amir Feder, Yoav Wald, Claudia Shi, Suchi Saria, and David Blei. Causal-structure driven augmentations for text ood generalization. Advances in Neural Information Processing Systems, 36, 2024.
- Eve Fleisig, Aubrie Amstutz, Chad Atalla, Su Lin Blodgett, Hal Daumé III, Alexandra Olteanu, Emily Sheng, Dan Vann, and Hanna Wallach. Fairprism: evaluating fairness-related harms in text generation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6231–6251, 2023.
- Jinglong Gao, Xiao Ding, Bing Qin, and Ting Liu. Is chatgpt a good causal reasoner? a comprehensive evaluation. In The 2023 Conference on Empirical Methods in Natural Language Processing, 2023.
- Yair Ori Gat, Nitay Calderon, Amir Feder, Alexander Chapanin, Amit Sharma, and Roi Reichart. Faithful explanations of black-box nlp models using llm-generated counterfactuals. In The Twelfth International Conference on Learning Representations, 2023.
- David F Jenny, Yann Billeter, Mrinmaya Sachan, Bernhard Schölkopf, and Zhijing Jin. Navigating the ocean of biases: Political bias attribution in language models via causal structures. arXiv preprint arXiv:2311.08605, 2023.
- Zhenlan Ji, Pingchuan Ma, Zongjie Li, and Shuai Wang. Benchmarking and explaining large language model-based code generation: A causality-centric approach. arXiv preprint arXiv:2310.06680, 2023.
- Haitao Jiang, Lin Ge, Yuhe Gao, Jianian Wang, and Rui Song. Large language model for causal decision making. arXiv preprint arXiv:2312.17122, 2023.
- Zhijing Jin, Yuen Chen, Felix Leeb, Luigi Gresele, Ojasv Kamal, Zhiheng Lyu, Kevin Blin, Fernando Gonzalez Adauto, Max Kleiman-Weiner, Mrinmaya Sachan, et al. Cladder: A benchmark to assess causal reasoning capabilities of language models. Advances in Neural Information Processing Systems, 36, 2024.
- Zhijing Jin, Jiarui Liu, LYU Zhiheng, Spencer Poff, Mrinmaya Sachan, Rada Mihalcea, Mona T Diab, and Bernhard Schölkopf. Can large language models infer causation from correlation? In The Twelfth International Conference on Learning Representations, 2024.
- Emre Kıcıman, Robert Ness, Amit Sharma, and Chenhao Tan. Causal reasoning and large language models: Opening a new frontier for causality. arXiv preprint arXiv:2305.00050, 2023.
- Yuheun Kim, Lu Guo, Bei Yu, and Yingya Li. Can chatgpt understand causal language in science claims? In Proceedings of the 13th Workshop on Computational Approaches to Subjectivity, Sentiment, & Social Media Analysis, pages 379–389, 2023.
- Dohwan Ko, Ji Soo Lee, Woo-Young Kang, Byungseok Roh, and Hyunwoo J Kim. Large language models are temporal and causal reasoners for video question answering. In The 2023 Conference on Empirical Methods in Natural Language Processing, 2023.
- Jingling Li, Zeyu Tang, Xiaoyu Liu, Peter Spirtes, Kun Zhang, Liu Leqi, and Yang Liu. Steering llms towards unbiased responses: A causality-guided debiasing framework. arXiv preprint arXiv:2403.08743, 2024.
- Xiaochuan Li, Baoyu Fan, Runze Zhang, Liang Jin, Di Wang, Zhenhua Guo, Yaqian Zhao, and Rengang Li. Image content generation with causal reasoning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 13646–13654, 2024.
- Yongqi Li, Mayi Xu, Xin Miao, Shen Zhou, and Tieyun Qian. Large language models as counterfactual generator: Strengths and weaknesses. arXiv preprint arXiv:2305.14791, 2023.
- Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al. Holistic evaluation of language models. Transactions on Machine Learning Research, 2023.
- Victoria Lin, Eli Ben-Michael, and Louis-Philippe Morency. Optimizing language models for human preferences is a causal inference problem. arXiv preprint arXiv:2402.14979, 2024.
- Xiao Liu, Da Yin, Chen Zhang, Yansong Feng, and Dongyan Zhao. The magic of if: Investigating causal reasoning abilities in large language models of code. In The 61st Annual Meeting Of The Association For Computational Linguistics, 2023.
- Stephanie Long, Tibor Schuster, and Alexandre Piché. Can large language models build causal graphs? In NeurIPS 2022 Workshop on Causality for Real-world Impact, 2022.
- Yujie Lu, Weixi Feng, Wanrong Zhu, Wenda Xu, Xin Eric Wang, Miguel Eckstein, and William Yang Wang. Neuro-symbolic procedural planning with commonsense prompting. In The Eleventh International Conference on Learning Representations, 2023.
- Aman Madaan, Katherine Hermann, and Amir Yazdanbakhsh. What makes chain-of-thought prompting effective? a counterfactual study. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 1448–1535, 2023.
- Rahul Madhavan, Rishabh Garg, Kahini Wadhawan, and Sameep Mehta. Cfl: Causally fair language models through token-level attribute controlled generation. In Findings of the Association for Computational Linguistics: ACL 2023, pages 11344–11358, 2023.
- Valentyn Melnychuk, Dennis Frauen, and Stefan Feuerriegel. Causal transformer for estimating counterfactual outcomes. In International Conference on Machine Learning, pages 15293–15329. PMLR, 2022.
- Xin Miao, Yongqi Li, and Tieyun Qian. Generating commonsense counterfactuals for stable relation extraction. In The 2023 Conference on Empirical Methods in Natural Language Processing, 2023.
- Paramita Mirza, Rachele Sprugnoli, Sara Tonelli, and Manuela Speranza. Annotating causality in the tempeval-3 corpus. In Proceedings of the EACL 2014 Workshop on Computational Approaches to Causality in Language (CAtoCL), pages 10–19, 2014.
- Allen Nie, Yuhui Zhang, Atharva Shailesh Amdekar, Chris Piech, Tatsunori B Hashimoto, and Tobias Gerstenberg. Moca: Measuring human-language model alignment on causal and moral judgment tasks. Advances in Neural Information Processing Systems, 36, 2024.
- Melissa Roemmele, Cosmin Adrian Bejan, and Andrew S Gordon. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In 2011 AAAI spring symposium series, 2011.
- Raanan Y Rohekar, Yaniv Gurwicz, and Shami Nisimov. Causal interpretation of self-attention in pretrained transformers. Advances in Neural Information Processing Systems, 36, 2024.
- Angelika Romanou, Syrielle Montariol, Debjit Paul, Leo Laugier, Karl Aberer, and Antoine Bosselut. Crab: Assessing the strength of causal relationships between real-world events. arXiv preprint arXiv:2311.04284, 2023.
- Hung-Ting Su, Yulei Niu, Xudong Lin, Winston H Hsu, and Shih-Fu Chang. Language models are causal knowledge extractors for zero-shot video question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4950–4959, 2023.
- Yongduo Sui, Xiang Wang, Jiancan Wu, Min Lin, Xiangnan He, and Tat-Seng Chua. Causal attention for interpretable and generalizable graph classification. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 1696–1705, 2022.
- Yan Tai, Weichen Fan, Zhao Zhang, and Ziwei Liu. Link-context learning for multimodal llms. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2717627185, 2024.
- Juanhe TJ Tan. Causal abstraction for chain-of-thought reasoning in arithmetic word problems. In Proceedings of the 6th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, pages 155–168, 2023.
- Ziyi Tang, Ruilin Wang, Weixing Chen, Keze Wang, Yang Liu, Tianshui Chen, and Liang Lin. Towards causalgpt: A multi-agent approach for faithful knowledge reasoning via promoting causal consistency in llms. arXiv preprint arXiv:2308.11914, 2023.
- Ruibo Tu, Chao Ma, and Cheng Zhang. Causal-discovery performance of chatgpt in the context of neuropathic pain diagnosis. arXiv preprint arXiv:2301.13819, 2023.
- Aniket Vashishtha, Abbavaram Gowtham Reddy, Abhinav Kumar, Saketh Bachu, Vineeth N Balasubramanian, and Amit Sharma. Causal inference using llm-guided discovery. In AAAI 2024 Workshop on”Are Large Language Models Simply Causal Parrots?”, 2023.
- Yuxuan Wan, Wenxuan Wang, Pinjia He, Jiazhen Gu, Haonan Bai, and Michael R Lyu. Biasasker: Measuring the bias in conversational ai system. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 515–527, 2023.
- Fei Wang, Wenjie Mo, Yiwei Wang, Wenxuan Zhou, and Muhao Chen. A causal view of entity bias in (large) language models. In The 2023 Conference on Empirical Methods in Natural Language Processing, 2023.
- Xiaozhi Wang, Yulin Chen, Ning Ding, Hao Peng, Zimu Wang, Yankai Lin, Xu Han, Lei Hou, Juanzi Li, Zhiyuan Liu, et al. Maven-ere: A unified large-scale dataset for event coreference, temporal, causal, and subevent relation extraction. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 926–941, 2022.
- Zhengxuan Wu, Atticus Geiger, Thomas Icard, Christopher Potts, and Noah Goodman. Interpretability at scale: Identifying causal mechanisms in alpaca. Advances in Neural Information Processing Systems, 36, 2024.
- Xu Yang, Hanwang Zhang, Guojun Qi, and Jianfei Cai. Causal attention for vision-language tasks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9847–9857, 2021.
- Wenhao Yu, Meng Jiang, Peter Clark, and Ashish Sabharwal. Ifqa: A dataset for open-domain question answering under counterfactual presuppositions. In The 2023 Conference on Empirical Methods in Natural Language Processing, 2023.
- Matej Zecevic, Moritz Willig, Devendra Singh Dhami, and Kristian Kersting. Causal parrots: Large language models may talk causality but are not causal. Transactions on Machine Learning Research, 2023.
- Jiaqi Zhang, Joel Jennings, Cheng Zhang, and Chao Ma. Towards causal foundation model: on duality between causal inference and attention. 2023.
- Li Zhang, Hainiu Xu, Yue Yang, Shuyan Zhou, Weiqiu You, Manni Arora, and Chris Callison-burch. Causal reasoning of entities and events in procedural texts. In The 61st Annual Meeting Of The Association For Computational Linguistics, 2023.
- Shuo Zhang, Liangming Pan, Junzhou Zhao, and William Yang Wang. Mitigating language model hallucination with interactive question-knowledge alignment. arXiv preprint arXiv:2305.13669, 2023.
- Haiteng Zhao, Chang Ma, Xinshuai Dong, Anh Tuan Luu, Zhi-Hong Deng, and Hanwang Zhang. Certified robustness against natural language attacks by causal intervention. In International Conference on Machine Learning, pages 26958–26970. PMLR, 2022.
- Wei Zhao, Zhe Li, and Jun Sun. Causality analysis for evaluating the security of large language models. arXiv e-prints, pages arXiv–2312, 2023.
- Junhao Zheng, Qianli Ma, Shengjie Qiu, Yue Wu, Peitian Ma, Junlong Liu, Huawen Feng, Xichen Shang, and Haibin Chen. Preserving commonsense knowledge from pre-trained language models via causal inference. In The 61st Annual Meeting Of The Association For Computational Linguistics, 2023.
- Fan Zhou, Yuzhou Mao, Liu Yu, Yi Yang, and Ting Zhong. Causal-debias: Unifying debiasing in pretrained language models and fine-tuning via causal invariant learning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4227–4241, 2023.
| Benchmark | LLM | Target | Tasks |
|---|---|---|---|
| Crab | Flan-Alpaca-GPT4-XL, GPT-3, GPT-4 | Assessing the Strength of Causal Relationships Between Real-World Events | Pairwise Causal Discovery (yes/no); Pairwise Causal Strength (high/medium/low/no); Pairwise Causal Score (0-100); Primary Causal Event (A/B/C/D) |
| CausalBench | BERT, RoBERTa, DeBERTa, DistilBERT, LLAMA, OPT, InternLM, Falcon, GPT3.5-Turbo, GPT4 | Exploring the capabilities of LLMs in understanding causal relationships of varying depths and difficulties | Direct/Indirect Correlation (yes/no); Causal Discovery in Skeleton (yes/no); Causal Discovery in DAG (yes/no) |
| CORR2CAUSE | BERT, RoBERTa, BART, DeBERTa, DistilBERT, GPT-3, GPT-3.5, GPT-4 | Evaluating the ability of large language models (LLMs) to infer causality from correlational statements | Pairwise Causal Hypothesis (yes/no) |
| Moca | RoBERTa, ALBERT, Electra, GPT-2, GPT-3, GPT-3.5, GPT-4, Alpaca-7B, Claude | Measuring Human-Language Model Alignment on Causal and Moral Judgment Tasks | Causal Judgment (yes/no); Moral Judgment (yes/no) |
| Cladder | GPT-4, GPT-3.5, GPT-3, LLaMa, Alpaca | Assessing causal reasoning in language models | Association (yes/no), Intervention (yes/no), Counterfactuals (yes/no) |
| IfQA | Codex and ChatGPT | Assessing models’ abilities to handle counterfactual reasoning in open-domain question-answering (QA) tasks | Counterfactual Reasoning in IfQA (Entity/Number/Date/Data) |
| Clomo | GPT-3.5-Turbo, GPT-4, GPT-4o, LLaMA, LLaMA2, Flan-T5, ChatGLM2, Baichuan2, InternLM, Vicuna-v1.5, Qwen, WizardLM | Assessing the counterfactual reasoning capabilities of large language models (LLMs) | Counterfactual Reasoning in Logical Modification (Necessary Assumption/Sufficient Assumption/Strength/Weaken) |
| Crass | BART, RoBERTa, MPNet, DeBERTa v3, GPT-3, T0pp | Assessing the counterfactual reasoning capabilities of large language models (LLMs) | Counterfactual Reasoning (choose correct answer) |
| QRData | GPT-4, GPT-3.5 Turbo, Gemini-Pro, Llama-2-chat, WizardMath, Deepseek-coder-instruct, CodeLlama-instruct, TableLlama, and AgentLM | Evaluating the statistical and causal reasoning abilities of large language models (LLMs) | Statistical Reasoning (Probability/Distribution/Estimation/Hypothesis Testing/Prediction); Causal Reasoning (Confounding/Causal Discovery/Causal Effect Estimation/Instrumental Variables/Panel Data); Causal Effect Estimation (ATE/ATT/ATC) |
| ChatGPT4CausalReasoning | BERT, RoBERTa, LLaMA, FLAN-T5, GPT-2, GPT-3, GPT-3.5, GPT-4 | Evaluating the causal reasoning capabilities of ChatGPT | Event Causality Identification (yes/no); Causal Discovery (Multiple Choice/Binary Classification); Causal Explanation Generation (Text Generation) |
| Counterfactual Simulatability | GPT-3.5, GPT-4 | Evaluating the counterfactual simulatability of natural language explanations | Multi-hop Factual Reasoning (Text Generation); Reward Modeling (Text Generation) |
| New Frontier | GPT-3, GPT-3.5, GPT-4 | Opening a new frontier for causality | Pairwise Causal Discovery (yes/no or A/B); Counterfactual Reasoning (choose correct answer) |
| Causal Parrots | Luminous, OPT, GPT-3, GPT-4 | Investigating if large language models (LLMs) truly understand causality or just mimic learned correlations | Common Sense Inference (yes/no); Causal Discovery (causal/non-causal); Knowledge Base Fact Embeddings (causal/anti-causal) |
| CaLM | Baichuan1, Baichuan2, InternLM, Llama 2, Qwen, Koala, Wizardcoder, Vicuna, GPT-3, GPT-3.5-Turbo, GPT-4, Claude2 | Evaluating causal reasoning capabilities of large language models (LLMs) | Four Question Types in 92 Causal Targets: Binary classification, Choice selection, Open-ended generation, and Probability calculation |
| CREPE | T5, T0, GPT-3, ChatGPT, Codex | Assessing the ability of large language models (LLMs) to reason causally about events and entities in procedural texts | Causal Reasoning in Procedural Texts (More Likely/Less Likely/Equally Likely); Factual Reasoning in Procedural Texts (True/False or Yes/No) |
| Knowledge | GPT-3.5, GPT-4, GPT-4 Turbo, Claude 2, LLaMa2, Mistral | Exploring the causal reasoning of large language models (LLMs) | Causal Discovery (Omit Knowledge, Omit Data); Pairwise Causal Discovery (Causal Direction); Reverse Causal Discovery (Causal Direction) |
| Causal Language | GPT-3.5, ChatGPT | Evaluating ChatGPT’s ability to understand causal language in science papers and news | Understand Causal Language (direct causal/conditional causal/correlational/no relationship) |
All types of contributions to this paper list is welcome. Feel free to open a Pull Request.
@article{wu2024causality,
title={Causality for Large Language Models: A Survey},
author={Wu, Anpeng and Kuang, Kun and Zhu, Minqin and Wang, Yingrong and Zheng, Yujia and Han, Kairong and Li, Baohong and Chen, Guangyi and Wu, Fei and Zhang, Kun},
journal={arXiv preprint arXiv:2410.15319},
year={2024}
}