REACT_LLM: Benchmark for Evaluating LLM Integration with Causal Features in Clinical Prognostic Tasks
Large Language Models (LLMs) and causal learning each hold strong potential for clinical decision making (CDM). However, their synergy remains poorly understood, largely due to the lack of systematic benchmarks evaluating their integration in clinical risk prediction. In real-world healthcare, identifying features with causal influence on outcomes is crucial for actionable and trustworthy predictions. While recent work highlights LLMs’ emerging causal reasoning abilities, there lacks comprehensive benchmarks to assess their causal learning and performance informed by causal features in clinical risk prediction. To address this, we introduce REACT-LLM, a benchmark designed to evaluate whether combining LLMs with causal features can enhance clinical prognostic performance and potentially outperform traditional machine learning (ML) methods. Unlike existing LLM-clinical benchmarks that often focus on a limited set of outcomes, REACT-LLM evaluates 7 clinical outcomes across 2 real-world datasets, comparing 15 prominent LLMs, 6 traditional ML models, and 3 causal discovery (CD) algorithms. Our findings indicate that while LLMs perform reasonably in clinical prognostics, they have not yet outperformed traditional ML models. Integrating causal features derived from CD algorithms into LLMs offers limited performance gains, primarily due to the strict assumptions of many CD methods, which are often violated in complex clinical data. While the direct integration yields limited improvement, our benchmark reveals a more promising synergy: LLMs serve effectively as knowledge-rich collaborators for identifying and optimizing causal features. Additionally, in-context learning improves LLM predictions when prompts are tailored to the task and model. Different LLMs show varying sensitivity to structured data encoding formats, for example, open-source models perform better with JSON, while smaller models benefit from narrative serialization. These findings highlight the need to match prompts and data formats to model architecture and pretraining.
Extended version: (https://arxiv.org/pdf/2511.07127)
We support the following datasets:
We provide three common tasks for clinical prediction:
| Task | Description |
|---|---|
DIEINHOSPITAL |
In-hospital mortality prediction |
Readmission_30 |
30-day readmission prediction |
Multiple_ICUs |
Multiple ICU admissions |
sepsis_all |
Sepsis development prediction |
FirstICU24_AKI_ALL |
Acute kidney injury within 24h |
LOS_Hospital |
Hospital length of stay |
ICU_within_12hr_of_admit |
Early ICU transfer prediction |
Clone the repository.
Download the environment:
pip install -r requirements.txtThe structure of the important files:
REACT_LLM/
├── prediction.py # Main prediction engine
├── retry_failed_predictions.py # Failure recovery system
├── requirements.txt # Python dependencies
├── selected_features.txt # Feature selection configuration
├── features-desc.csv # Feature descriptions mapping
├── CORL_F.txt # CORL algorithm features
├── DirectLiNGAM_F.txt # DirectLiNGAM algorithm features
├── GES_F.txt # GES algorithm features
├── icl_examples/ # In-context learning prompt configuration
├── LLMs_CD/ # LLM-generated causal discovery configuration
├── optimization_results/ # Feature optimization configuration
├── scripts/ # Additional utility scripts
│ ├── generate_causal_features.py # LLM-based causal feature generation
│ ├── optimize_features.py # Feature optimization and refinement
│ ├── LLM_MIMIC_Data_preprocess/ # MIMIC data preprocessing notebooks
│ │ ├── MIMIC_patients_0.ipynb # Patient demographics and clinical data processing
│ │ ├── MIMIC_TS_CHART_LAB.ipynb # Time-series vital signs and lab data processing
│ │ └── ML_Models.ipynb # Traditional ML baselines and model training
│ └── Causal_Discovery/ # Causal discovery algorithm implementation
│ ├── CORL.ipynb # CORL causal discovery algorithm
│ ├── DirectLiNGAM.ipynb # DirectLiNGAM causal discovery algorithm
│ └── GES.ipynb # GES causal discovery algorithm
├── results/ # Prediction results and logs
└── README.md # This documentationThe framework requires preprocessed MIMIC datasets with standardized features. Raw MIMIC data must be processed through the provided preprocessing pipeline to generate the required input format.
- Download MIMIC-III v1.4 and MIMIC-IV v3.0 datasets through PhysioNet.
- Complete PhysioNet credentialing and sign data use agreements.
- Open
scripts/LLM_MIMIC_Data_preprocess/MIMIC_patients_0.ipynb - Update dataset paths in Cell 2:
PATIENTS = pd.read_csv('your_path/PATIENTS.csv.gz') ADMISSIONS = pd.read_csv('your_path/ADMISSIONS.csv.gz') ICUSTAYS = pd.read_csv('your_path/ICUSTAYS.csv.gz')
- Update paths in
MIMIC_TS_CHART_LAB.ipynbandML_Models.ipynbaccordingly.
Execute the preprocessing notebooks in sequence:
# 1. Process patient demographics, diagnoses, procedures, and medications
jupyter nbconvert --execute scripts/LLM_MIMIC_Data_preprocess/MIMIC_patients_0.ipynb
# 2. Process time-series vital signs and laboratory data
jupyter nbconvert --execute scripts/LLM_MIMIC_Data_preprocess/MIMIC_TS_CHART_LAB.ipynb
# 3. Generate final dataset for machine learning baselines
jupyter nbconvert --execute scripts/LLM_MIMIC_Data_preprocess/ML_Models.ipynbThe preprocessing pipeline generates datasets with:
- Predictive features: 4 basic + 65 diagnostic + 27 procedural + 55 medication + 115 time-series features
- 7 clinical outcome labels: DIEINHOSPITAL, Readmission_30, Multiple_ICUs, sepsis_all, FirstICU24_AKI_ALL, LOS_Hospital, ICU_within_12hr_of_admit
- Quality filters: Adults (≥18 years), first ICU stays, LOS ≥1 day
-
Configure Dataset Path Edit the data file path in
prediction.py:data_filename = 'your_data_file.csv' # Here modify the data file
-
Configure Model Settings Edit the
MODEL_CONFIGinprediction.py(example configuration):MODEL_CONFIG = { "model_name": "gpt-4", # Example: gpt-4, claude-3-sonnet, qwen3-8b "display_name": "GPT-4", # Display name for the model "api_type": "openai", # Options: "openai" or "dashscope" "label": "DIEINHOSPITAL", # Prediction task (see supported tasks below) "prompt_mode": "DIRECTLY_PROMPTING", # Prompting strategy (see supported modes below) # API credentials (replace with your actual credentials) "openai_config": { "api_key": "your_actual_api_key", # Replace with your API key "api_base": "your_api_endpoint" # Replace with your API endpoint } }
Supported Prediction Tasks:
DIEINHOSPITAL,Readmission_30,Multiple_ICUs,sepsis_all,FirstICU24_AKI_ALL,LOS_Hospital,ICU_within_12hr_of_admit
Supported Prompt Modes:
DIRECTLY_PROMPTING,CHAIN_OF_THOUGHT,SELF_REFLECTION,ROLE_PLAYING,IN_CONTEXT_LEARNINGCSV_DIRECT,CSV_RAW,JSON_STRUCTURED,LATEX_TABLE,NATURAL_LANGUAGECORL_FILTERED,DirectLiNGAM_FILTERED,GES_FILTERED,CD_FEATURES_OPTIMIZED,LLM_CD_FEATURES
Debug Mode (recommended for first use):
# Edit prediction.py to enable debug mode
DEBUG_MODE = True
DEBUG_PATIENTS = 3
# Run prediction
python prediction.pyProduction Mode (for full dataset):
# Edit prediction.py to disable debug mode
DEBUG_MODE = False
# Run prediction
python prediction.pyConfigure retry_failed_predictions.py for failed prediction recovery:
# Edit retry script configuration
MANUAL_MODE = True # Enable manual mode
MANUAL_CONFIG = {
"input_csv": "your_data_file.csv", # Original dataset
"csv_file": "failed_result.csv", # Failed prediction file
"json_file": "failed_result.json", # Failed experiment log
"override_model_config": {...} # Model config (or None for auto-extract)
}python retry_failed_predictions.py*.csv # Prediction results with patient IDs, probabilities, and ground truth
*.json # Experimental logs with prompt-response pairs and model configurations
*.txt # Performance metrics including F1-Score, AUROC, and AUPRC
The framework automatically calculates:
- F1-Score: Harmonic mean of precision and recall
- AUROC: Area Under the Receiver Operating Characteristic Curve
- AUPRC: Area Under the Precision-Recall Curve
Qwen3-8B Qwen3-14B Qwen3-235B Llama-3.1-405B DeepSeek-R1 DeepSeek-V3 Gemini-2-Pro Gemini-2-Flash GPT-o1 GPT-o3-mini GPT-4o GPT-4o-mini Claude-4 Claude-3.5-Haiku Claude-3.7-Sonnet
The framework incorporates three causal discovery algorithms. Implementation code available in scripts/Causal_Discovery/:
- CORL (Causal Ordering via Reinforcement Learning):
CORL_F.txt - DirectLiNGAM (Direct Linear Non-Gaussian Acyclic Model):
DirectLiNGAM_F.txt - GES (Greedy Equivalence Search):
GES_F.txt
- Standard Approaches: Direct prompting, Chain-of-Thought
- Reflective Methods: Self-reflection, role-playing
- Context-Aware: In-context learning with clinical examples
- Data Format Variants: CSV, JSON, LaTeX table, natural language
- Causal-Informed: Feature-filtered approaches using causal discovery
The system utilizes structured feature definitions in selected_features.txt:
- Basic Features: Demographics and admission characteristics
- Diagnostic Features: ICD-based diagnostic codes
- Procedural Features: Medical procedures and interventions
- Medication Features: Pharmacological treatments
- Time-Series Features: Physiological measurements and vital signs
This work builds upon the MIMIC-III and MIMIC-IV critical care databases and incorporates established causal discovery algorithms. We acknowledge the contributions of the clinical informatics and machine learning communities in developing these foundational resources.