REACT_LLM: Benchmark for Evaluating LLM Integration with Causal Features in Clinical Prognostic Tasks

Abstract

Large Language Models (LLMs) and causal learning each hold strong potential for clinical decision making (CDM). However, their synergy remains poorly understood, largely due to the lack of systematic benchmarks evaluating their integration in clinical risk prediction. In real-world healthcare, identifying features with causal influence on outcomes is crucial for actionable and trustworthy predictions. While recent work highlights LLMs’ emerging causal reasoning abilities, there lacks comprehensive benchmarks to assess their causal learning and performance informed by causal features in clinical risk prediction. To address this, we introduce REACT-LLM, a benchmark designed to evaluate whether combining LLMs with causal features can enhance clinical prognostic performance and potentially outperform traditional machine learning (ML) methods. Unlike existing LLM-clinical benchmarks that often focus on a limited set of outcomes, REACT-LLM evaluates 7 clinical outcomes across 2 real-world datasets, comparing 15 prominent LLMs, 6 traditional ML models, and 3 causal discovery (CD) algorithms. Our findings indicate that while LLMs perform reasonably in clinical prognostics, they have not yet outperformed traditional ML models. Integrating causal features derived from CD algorithms into LLMs offers limited performance gains, primarily due to the strict assumptions of many CD methods, which are often violated in complex clinical data. While the direct integration yields limited improvement, our benchmark reveals a more promising synergy: LLMs serve effectively as knowledge-rich collaborators for identifying and optimizing causal features. Additionally, in-context learning improves LLM predictions when prompts are tailored to the task and model. Different LLMs show varying sensitivity to structured data encoding formats, for example, open-source models perform better with JSON, while smaller models benefit from narrative serialization. These findings highlight the need to match prompts and data formats to model architecture and pretraining.

Extended version: (https://arxiv.org/pdf/2511.07127)

We support the following datasets:

We provide three common tasks for clinical prediction:

Task	Description
`DIEINHOSPITAL`	In-hospital mortality prediction
`Readmission_30`	30-day readmission prediction
`Multiple_ICUs`	Multiple ICU admissions
`sepsis_all`	Sepsis development prediction
`FirstICU24_AKI_ALL`	Acute kidney injury within 24h
`LOS_Hospital`	Hospital length of stay
`ICU_within_12hr_of_admit`	Early ICU transfer prediction

Environment Setup

Clone the repository.

Download the environment:

pip install -r requirements.txt

The structure of the important files:

REACT_LLM/
├── prediction.py                    # Main prediction engine
├── retry_failed_predictions.py      # Failure recovery system
├── requirements.txt                 # Python dependencies
├── selected_features.txt           # Feature selection configuration
├── features-desc.csv               # Feature descriptions mapping
├── CORL_F.txt                      # CORL algorithm features
├── DirectLiNGAM_F.txt              # DirectLiNGAM algorithm features
├── GES_F.txt                       # GES algorithm features
├── icl_examples/                   # In-context learning prompt configuration
├── LLMs_CD/                        # LLM-generated causal discovery configuration
├── optimization_results/           # Feature optimization configuration
├── scripts/                        # Additional utility scripts
│   ├── generate_causal_features.py # LLM-based causal feature generation
│   ├── optimize_features.py        # Feature optimization and refinement
│   ├── LLM_MIMIC_Data_preprocess/  # MIMIC data preprocessing notebooks
│   │   ├── MIMIC_patients_0.ipynb  # Patient demographics and clinical data processing
│   │   ├── MIMIC_TS_CHART_LAB.ipynb # Time-series vital signs and lab data processing
│   │   └── ML_Models.ipynb         # Traditional ML baselines and model training
│   └── Causal_Discovery/           # Causal discovery algorithm implementation
│       ├── CORL.ipynb              # CORL causal discovery algorithm
│       ├── DirectLiNGAM.ipynb      # DirectLiNGAM causal discovery algorithm
│       └── GES.ipynb               # GES causal discovery algorithm
├── results/                        # Prediction results and logs
└── README.md                       # This documentation

Data Preparation

The framework requires preprocessed MIMIC datasets with standardized features. Raw MIMIC data must be processed through the provided preprocessing pipeline to generate the required input format.

Prerequisites

Download MIMIC-III v1.4 and MIMIC-IV v3.0 datasets through PhysioNet.
Complete PhysioNet credentialing and sign data use agreements.

Configuration

Open scripts/LLM_MIMIC_Data_preprocess/MIMIC_patients_0.ipynb

Update dataset paths in Cell 2:

PATIENTS = pd.read_csv('your_path/PATIENTS.csv.gz')
ADMISSIONS = pd.read_csv('your_path/ADMISSIONS.csv.gz') 
ICUSTAYS = pd.read_csv('your_path/ICUSTAYS.csv.gz')

Update paths in MIMIC_TS_CHART_LAB.ipynb and ML_Models.ipynb accordingly.

Processing Pipeline

Execute the preprocessing notebooks in sequence:

# 1. Process patient demographics, diagnoses, procedures, and medications
jupyter nbconvert --execute scripts/LLM_MIMIC_Data_preprocess/MIMIC_patients_0.ipynb

# 2. Process time-series vital signs and laboratory data  
jupyter nbconvert --execute scripts/LLM_MIMIC_Data_preprocess/MIMIC_TS_CHART_LAB.ipynb

# 3. Generate final dataset for machine learning baselines
jupyter nbconvert --execute scripts/LLM_MIMIC_Data_preprocess/ML_Models.ipynb

Output Specification

The preprocessing pipeline generates datasets with:

Predictive features: 4 basic + 65 diagnostic + 27 procedural + 55 medication + 115 time-series features
7 clinical outcome labels: DIEINHOSPITAL, Readmission_30, Multiple_ICUs, sepsis_all, FirstICU24_AKI_ALL, LOS_Hospital, ICU_within_12hr_of_admit
Quality filters: Adults (≥18 years), first ICU stays, LOS ≥1 day

Usage

Inference

Basic Configuration

Configure Dataset Path Edit the data file path in prediction.py:

data_filename = 'your_data_file.csv'  # Here modify the data file

Configure Model Settings Edit the MODEL_CONFIG in prediction.py (example configuration):

MODEL_CONFIG = {
    "model_name": "gpt-4",                    # Example: gpt-4, claude-3-sonnet, qwen3-8b
    "display_name": "GPT-4",                  # Display name for the model
    "api_type": "openai",                     # Options: "openai" or "dashscope"
    "label": "DIEINHOSPITAL",                 # Prediction task (see supported tasks below)
    "prompt_mode": "DIRECTLY_PROMPTING",      # Prompting strategy (see supported modes below)
    
    # API credentials (replace with your actual credentials)
    "openai_config": {
        "api_key": "your_actual_api_key",     # Replace with your API key
        "api_base": "your_api_endpoint"       # Replace with your API endpoint
    }
}

Supported Prediction Tasks:

DIEINHOSPITAL, Readmission_30, Multiple_ICUs, sepsis_all, FirstICU24_AKI_ALL, LOS_Hospital, ICU_within_12hr_of_admit

Supported Prompt Modes:

DIRECTLY_PROMPTING, CHAIN_OF_THOUGHT, SELF_REFLECTION, ROLE_PLAYING, IN_CONTEXT_LEARNING
CSV_DIRECT, CSV_RAW, JSON_STRUCTURED, LATEX_TABLE, NATURAL_LANGUAGE
CORL_FILTERED, DirectLiNGAM_FILTERED, GES_FILTERED, CD_FEATURES_OPTIMIZED, LLM_CD_FEATURES

Run Prediction

Debug Mode (recommended for first use):

# Edit prediction.py to enable debug mode
DEBUG_MODE = True
DEBUG_PATIENTS = 3

# Run prediction
python prediction.py

Production Mode (for full dataset):

# Edit prediction.py to disable debug mode
DEBUG_MODE = False

# Run prediction
python prediction.py

Handle Failures

Configure retry_failed_predictions.py for failed prediction recovery:

# Edit retry script configuration
MANUAL_MODE = True  # Enable manual mode
MANUAL_CONFIG = {
    "input_csv": "your_data_file.csv",           # Original dataset
    "csv_file": "failed_result.csv",             # Failed prediction file
    "json_file": "failed_result.json",           # Failed experiment log  
    "override_model_config": {...}               # Model config (or None for auto-extract)
}

python retry_failed_predictions.py

Evaluate

Output Files

*.csv  # Prediction results with patient IDs, probabilities, and ground truth
*.json # Experimental logs with prompt-response pairs and model configurations  
*.txt  # Performance metrics including F1-Score, AUROC, and AUPRC

Evaluation Metrics

The framework automatically calculates:

F1-Score: Harmonic mean of precision and recall
AUROC: Area Under the Receiver Operating Characteristic Curve
AUPRC: Area Under the Precision-Recall Curve

Supported Models

Qwen3-8B Qwen3-14B Qwen3-235B Llama-3.1-405B DeepSeek-R1 DeepSeek-V3 Gemini-2-Pro Gemini-2-Flash GPT-o1 GPT-o3-mini GPT-4o GPT-4o-mini Claude-4 Claude-3.5-Haiku Claude-3.7-Sonnet

Advanced Features

Causal Discovery Integration

The framework incorporates three causal discovery algorithms. Implementation code available in scripts/Causal_Discovery/:

CORL (Causal Ordering via Reinforcement Learning): CORL_F.txt
DirectLiNGAM (Direct Linear Non-Gaussian Acyclic Model): DirectLiNGAM_F.txt
GES (Greedy Equivalence Search): GES_F.txt

Prompt Engineering Strategies

Standard Approaches: Direct prompting, Chain-of-Thought
Reflective Methods: Self-reflection, role-playing
Context-Aware: In-context learning with clinical examples
Data Format Variants: CSV, JSON, LaTeX table, natural language
Causal-Informed: Feature-filtered approaches using causal discovery

Feature Configuration

The system utilizes structured feature definitions in selected_features.txt:

Basic Features: Demographics and admission characteristics
Diagnostic Features: ICD-based diagnostic codes
Procedural Features: Medical procedures and interventions
Medication Features: Pharmacological treatments
Time-Series Features: Physiological measurements and vital signs

Acknowledgments

This work builds upon the MIMIC-III and MIMIC-IV critical care databases and incorporates established causal discovery algorithms. We acknowledge the contributions of the clinical informatics and machine learning communities in developing these foundational resources.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

REACT_LLM: Benchmark for Evaluating LLM Integration with Causal Features in Clinical Prognostic Tasks

Abstract

Environment Setup

Data Preparation

Prerequisites

Configuration

Processing Pipeline

Output Specification

Usage

Inference

Basic Configuration

Run Prediction

Handle Failures

Evaluate

Output Files

Evaluation Metrics

Supported Models

Advanced Features

Causal Discovery Integration

Prompt Engineering Strategies

Feature Configuration

Acknowledgments

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
LLMs_CD		LLMs_CD
icl_examples		icl_examples
optimization_results		optimization_results
scripts		scripts
CORL_F.txt		CORL_F.txt
DirectLiNGAM_F.txt		DirectLiNGAM_F.txt
GES_F.txt		GES_F.txt
README.md		README.md
features-desc.csv		features-desc.csv
framework.png		framework.png
prediction.py		prediction.py
requirements.txt		requirements.txt
retry_failed_predictions.py		retry_failed_predictions.py
selected_features.txt		selected_features.txt

Youzhixuan/REACT_LLM

Folders and files

Latest commit

History

Repository files navigation

REACT_LLM: Benchmark for Evaluating LLM Integration with Causal Features in Clinical Prognostic Tasks

Abstract

Environment Setup

Data Preparation

Prerequisites

Configuration

Processing Pipeline

Output Specification

Usage

Inference

Basic Configuration

Run Prediction

Handle Failures

Evaluate

Output Files

Evaluation Metrics

Supported Models

Advanced Features

Causal Discovery Integration

Prompt Engineering Strategies

Feature Configuration

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages