RareDAI

RareDAI is an advanced LLM technique, fine-tuned on LLama 3.1 models, designed to support genetic counselors and patients in choosing the most appropriate molecular genetic tests, such as gene panels or WES/WGS, through clear and comprehensive explanations. The model in the paper was fine-tuned using data from the Children’s Hospital of Philadelphia (CHOP). Due to the presence of protected health information, we cannot publicly release the model; however, we have provided guidelines for adapting or fine-tuning LLMs on in-house data. The model accepts clinical notes and Phecodes (converted from ICD-10) as input. You can fine-tune your own model with additional details (such as phenotypes HPO, demographics, etc); however, we recommend that the additional information may be only useful if they are concise and not redundant or irrelevant. This process is elaborated in the subsequent section.

RareAI is distributed under the MIT License by Wang Genomics Lab.

Installation

Clone this repository and navigate to PhenoGPT2 folder

git clone https://github.com/WGLab/RareDAI.git
cd RareDAI

Install all dependencies

conda create -n raredai python=3.11
conda activate raredai
conda create -n phenogpt2 python=3.11
conda activate phenogpt2
conda install pandas numpy scikit-learn matplotlib seaborn requests
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
conda install -c "nvidia/label/cuda-12.8" cuda-toolkit
conda install -c nvidia cuda-compiler
conda install -c conda-forge jupyter
conda install intel-openmp blas mpi4py
conda install -c anaconda ipykernel
pip install transformers datasets
pip install fastobo sentencepiece einops protobuf
pip install evaluate sacrebleu scipy accelerate deepspeed
pip install git+https://github.com/huggingface/peft.git
# PLEASE LOAD CUDA MODE IN YOUR ENVIRONMENT BEFORE INSTALL FLASH ATTENTION PACKAGE. FOR EXAMPLE BELOW:
module load CUDA/12.1.1
pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.1/flash_attn-2.8.1+cu12torch2.7cxx11abiTRUE-cp311-cp311-linux_x86_64.whl
pip install xformers
pip install bitsandbytes
conda install -c anaconda ipykernel
python -m ipykernel install --user --name=raredai

In the command above, we utilize the accelerate package for model sharding. PEFT package is used for efficient fine-tuning like LORA. bitsandbytes package is used for model quantization. Please pip uninstalll package and pip install package if you encounter any running issues.

Our top performing Llama models are:

1) Fine-tuned 8B (16bit) with CoT + ICD10 on summarized clinical notes.
2) Fine-tuned 70B (4bit QLoRA) with CoT + ICD10 + Phenotypes on raw clinical notes.
3) Fine-tuned 8B (16bit) with CoT + ICD10 on raw clinical notes.

Foundation Model Download:

To use LLaMA 3.1 8B (or 70B) model, please apply for access first and download it into the local drive. Download here
Save model in the Llama3_1/Meta-Llama-3.1-8B-Instruct (it should contain the tokenizer and model weights)

Data Setup

Input:
- Input files should be json files including "input", "hpo", "icd", "mrn" for inference and additionally "output" if fine-tuning. You will need to generate synthetic summary ("summary") and Chain-of-Thought (CoT).
- Input file can be either a single json file or a whole directory containing all input json files
Download additional Database
- Phecode ICD10: download phecode_definitions1.2.csv and Phecode_map_v1_2_icd10_beta.csv from the link. We also provide these databases in this GitHub.

Due to the protected health information, we cannot share our fine-tuned models publicly. However, we provide the codes and detailed instructions so that you should be able to replicate our processes with minimal efforts.

Fine-tuning

The fine-tuning process is divided into four stages:

Data collection: please process and clean your own data before fine-tuning/inference. You can refer our paper to see how our notes are selected. Save all the features (input, icd, mrn, phenotypes, output) in the JSON file for each patient.
- The input file should be a LIST of JSON files with input, icd, mrn, phenotypes, output fields.
- Make sure your ICD10 are converted to Phecodes and separated by "|". You can use some codes in our script to process the data.
- Example: Intestinal infection | Fractures | joint disorders and dislocations; trauma-related
Generate summary for training and validation datasets (only required if you want to fine-tune models with summary note).
- Please modify the necessary SLURM arguments in run_summary.sh to run generate_summary.py
- You should provide the directory where your input JSON-formatted files are located and the file directory of the output in which you want to save the synthetic data. Each of your resulting JSON data should have an additional "summary" key (including testing data). Make sure your input file has the correct keys like mentioned above.
- Sample run:
```
sbatch -p gpuq --gres=gpu:a100:2 --cpus-per-gpu=3 --mem-per-cpu=50G --time=3-00:00:00 --profile=all --export=ALL --wrap="bash run_summary.sh -i input.json -o output.json -model_path foundation_model_path"
```

Up to this point, you should split your own data into train:validation:test (ratio of 6:2:2). You only need to generate synthetic CoT for your train and validation data. You can fine-tune model either on raw clinical notes (data_point['input']) OR summary notes you generated above (data_point['summary]).

Generate synthetic CoT for training and validation datasets.
- Please modify the necessary SLURM arguments in run_syntheticCOT.sh to run generate_syntheticCOT.py
- You should provide the directory where your training/validation JSON-formatted files are located as the input and the file directory of the output in which you want to save the synthetic data. Each of your resulting JSON data should have an additional "cot" key (except testing data). Make sure your input file has the correct keys like mentioned above.
- Provide the foundation model (i.e LLaMA 3.1 70B) path with -model_path
- Add flag -hpo for "Phenotype", -icd for "ICD10/Phecode (text-based)", and -summary for "the summarized notes".
- Sample run:
```
sbatch -p gpuq --gres=gpu:a100:1 --cpus-per-gpu=3 --mem-per-cpu=50G --time=3-00:00:00 --profile=all --export=ALL --wrap="bash run_syntheticCOT.sh -i your_input_file.json -o your_output_file.json -model_path foundation_model_path -hpo -icd"
```
Fine-tune the model with generated synthetic CoT from stage 3.
- The resulting file is a LIST of JSON files, which fits with the fine-tuning script.
- Please modify the necessary SLURM arguments in run_RareDAI.sh to run RareDAI_finetuning.py
- Optional: Add flag -hpo for "Phenotype", -icd for "ICD10/Phecode (text-based)", -cot for "adding cot in answer", -lora for "LoRA training", -qlora for "QLoRA training", and -summary for "the summarized notes".
- Example:
```
sbatch -p gpuq --gres=gpu:a100:4 --cpus-per-gpu=2 --mem=300G --job-name=rareDAI_ft --wrap="bash run_RareDAI.sh -train_dir train.json -val_dir val.json -o models/ -model_path llama3.1-8B_folder --icd --cot"
```

If you want to fine-tune 70B model, you may need to use QLoRA 4bit + PEFT to reduce the size of the models even though the higher precisions and full-parameter may achieve better results.
If you're encountering CUDA memory issues, it’s likely due to large input texts exceeding your system’s capacity for model training. To address this issue, consider either increasing the number of GPUs and CPUs or adjusting training parameters by reducing the batch size or increasing gradient accumulation steps.
You can attempt to use run_deidentify_notes.sh to either deidentify your raw clinical note or CoT explanations.

Inference

Please fine-tune your own model first. Please follow the inference section of the inference.py to run your model.

You can use RareDAI_codebook.ipynb for real-time inference. Please adjust the input arguments as needed. An example is provided to help you get started—it typically runs in under 5 seconds. The synthetic clinical note included here was generated by ChatGPT. While the content is simulated, the phenotypes, ICD-10 descriptions, and other clinical details are designed to closely mimic real patient data that was recommended with a recommendation for genome sequencing.

Otherwise, please use the following command for inference:

sbatch -p gpuq --gres=gpu:a100:2 --cpus-per-gpu=3 --mem-per-cpu=50G --job-name=RareDAI_inference --wrap="bash run_inference.sh -i ...input.json -o output/ -model_dir your_model_path"

Required Arguments

Argument	Description
`-i`, `--input`	Required. Path to your input data. Can be a `.json`, `.pkl`, or a folder containing `.txt` or image files.
`-o`, `--output`	Required. Output directory name. This is where results will be saved. The directory will be created if it does not exist.
`-model_dir`	Required. Path to the fine-tuned model directory (e.g. a full fine-tuned or LoRA/QLoRA weights). If not provided, defaults will be used.

Optional Arguments

Argument	Description
`-lora`, `--lora`	Provide LoRA-adapted model path. Please use foundation model for -model_dir if you use LoRA
`-qlora`, `--qlora`	Provide QLoRA-adapted model path. Please use foundation model for -model_dir if you use QLoRA.

Developers:

Quan Minh Nguyen - Bioengineering PhD student at the University of Pennsylvania

Dr. Kai Wang - Professor of Pathology and Laboratory Medicine at the University of Pennsylvania and Children's Hospital of Philadelphia

Citation

The paper is preparing!

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
database		database
example		example
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

RareDAI

Contents

Installation

Our top performing Llama models are:

Foundation Model Download:

Data Setup

Fine-tuning

Inference

Required Arguments

Optional Arguments

Developers:

Citation

About

Uh oh!

Releases

Packages

Languages

License

WGLab/RareDAI

Folders and files

Latest commit

History

Repository files navigation

RareDAI

Contents

Installation

Our top performing Llama models are:

Foundation Model Download:

Data Setup

Fine-tuning

Inference

Required Arguments

Optional Arguments

Developers:

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages