RareDAI is an advanced LLM technique, fine-tuned on LLama 3.1 models, designed to support genetic counselors and patients in choosing the most appropriate molecular genetic tests, such as gene panels or WES/WGS, through clear and comprehensive explanations. The model in the paper was fine-tuned using data from the Children’s Hospital of Philadelphia (CHOP). Due to the presence of protected health information, we cannot publicly release the model; however, we have provided guidelines for adapting or fine-tuning LLMs on in-house data. The model accepts clinical notes and Phecodes (converted from ICD-10) as input. You can fine-tune your own model with additional details (such as phenotypes HPO, demographics, etc); however, we recommend that the additional information may be only useful if they are concise and not redundant or irrelevant. This process is elaborated in the subsequent section.
RareAI is distributed under the MIT License by Wang Genomics Lab.
- Clone this repository and navigate to PhenoGPT2 folder
git clone https://github.com/WGLab/RareDAI.git
cd RareDAI- Install all dependencies
conda create -n raredai python=3.11
conda activate raredai
conda create -n phenogpt2 python=3.11
conda activate phenogpt2
conda install pandas numpy scikit-learn matplotlib seaborn requests
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
conda install -c "nvidia/label/cuda-12.8" cuda-toolkit
conda install -c nvidia cuda-compiler
conda install -c conda-forge jupyter
conda install intel-openmp blas mpi4py
conda install -c anaconda ipykernel
pip install transformers datasets
pip install fastobo sentencepiece einops protobuf
pip install evaluate sacrebleu scipy accelerate deepspeed
pip install git+https://github.com/huggingface/peft.git
# PLEASE LOAD CUDA MODE IN YOUR ENVIRONMENT BEFORE INSTALL FLASH ATTENTION PACKAGE. FOR EXAMPLE BELOW:
module load CUDA/12.1.1
pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.1/flash_attn-2.8.1+cu12torch2.7cxx11abiTRUE-cp311-cp311-linux_x86_64.whl
pip install xformers
pip install bitsandbytes
conda install -c anaconda ipykernel
python -m ipykernel install --user --name=raredaiIn the command above, we utilize the accelerate package for model sharding. PEFT package is used for efficient fine-tuning like LORA. bitsandbytes package is used for model quantization. Please pip uninstalll package and pip install package if you encounter any running issues.
1) Fine-tuned 8B (16bit) with CoT + ICD10 on summarized clinical notes.
2) Fine-tuned 70B (4bit QLoRA) with CoT + ICD10 + Phenotypes on raw clinical notes.
3) Fine-tuned 8B (16bit) with CoT + ICD10 on raw clinical notes.
- To use LLaMA 3.1 8B (or 70B) model, please apply for access first and download it into the local drive. Download here
- Save model in the Llama3_1/Meta-Llama-3.1-8B-Instruct (it should contain the tokenizer and model weights)
- Input:
- Input files should be json files including "input", "hpo", "icd", "mrn" for inference and additionally "output" if fine-tuning. You will need to generate synthetic summary ("summary") and Chain-of-Thought (CoT).
- Input file can be either a single json file or a whole directory containing all input json files
- Download additional Database
- Phecode ICD10: download phecode_definitions1.2.csv and Phecode_map_v1_2_icd10_beta.csv from the link. We also provide these databases in this GitHub.
Due to the protected health information, we cannot share our fine-tuned models publicly. However, we provide the codes and detailed instructions so that you should be able to replicate our processes with minimal efforts.
The fine-tuning process is divided into four stages:
-
Data collection: please process and clean your own data before fine-tuning/inference. You can refer our paper to see how our notes are selected. Save all the features (input, icd, mrn, phenotypes, output) in the JSON file for each patient.
- The input file should be a LIST of JSON files with input, icd, mrn, phenotypes, output fields.
- Make sure your ICD10 are converted to Phecodes and separated by "|". You can use some codes in our script to process the data.
- Example: Intestinal infection | Fractures | joint disorders and dislocations; trauma-related
-
Generate summary for training and validation datasets (only required if you want to fine-tune models with summary note).
- Please modify the necessary SLURM arguments in run_summary.sh to run generate_summary.py
- You should provide the directory where your input JSON-formatted files are located and the file directory of the output in which you want to save the synthetic data. Each of your resulting JSON data should have an additional "summary" key (including testing data). Make sure your input file has the correct keys like mentioned above.
- Sample run:
sbatch -p gpuq --gres=gpu:a100:2 --cpus-per-gpu=3 --mem-per-cpu=50G --time=3-00:00:00 --profile=all --export=ALL --wrap="bash run_summary.sh -i input.json -o output.json -model_path foundation_model_path"
Up to this point, you should split your own data into train:validation:test (ratio of 6:2:2). You only need to generate synthetic CoT for your train and validation data. You can fine-tune model either on raw clinical notes (data_point['input']) OR summary notes you generated above (data_point['summary]).
-
Generate synthetic CoT for training and validation datasets.
- Please modify the necessary SLURM arguments in run_syntheticCOT.sh to run generate_syntheticCOT.py
- You should provide the directory where your training/validation JSON-formatted files are located as the input and the file directory of the output in which you want to save the synthetic data. Each of your resulting JSON data should have an additional "cot" key (except testing data). Make sure your input file has the correct keys like mentioned above.
- Provide the foundation model (i.e LLaMA 3.1 70B) path with -model_path
- Add flag -hpo for "Phenotype", -icd for "ICD10/Phecode (text-based)", and -summary for "the summarized notes".
- Sample run:
sbatch -p gpuq --gres=gpu:a100:1 --cpus-per-gpu=3 --mem-per-cpu=50G --time=3-00:00:00 --profile=all --export=ALL --wrap="bash run_syntheticCOT.sh -i your_input_file.json -o your_output_file.json -model_path foundation_model_path -hpo -icd" -
Fine-tune the model with generated synthetic CoT from stage 3.
- The resulting file is a LIST of JSON files, which fits with the fine-tuning script.
- Please modify the necessary SLURM arguments in run_RareDAI.sh to run RareDAI_finetuning.py
- Optional: Add flag -hpo for "Phenotype", -icd for "ICD10/Phecode (text-based)", -cot for "adding cot in answer", -lora for "LoRA training", -qlora for "QLoRA training", and -summary for "the summarized notes".
- Example:
sbatch -p gpuq --gres=gpu:a100:4 --cpus-per-gpu=2 --mem=300G --job-name=rareDAI_ft --wrap="bash run_RareDAI.sh -train_dir train.json -val_dir val.json -o models/ -model_path llama3.1-8B_folder --icd --cot"
-
If you want to fine-tune 70B model, you may need to use QLoRA 4bit + PEFT to reduce the size of the models even though the higher precisions and full-parameter may achieve better results.
-
If you're encountering CUDA memory issues, it’s likely due to large input texts exceeding your system’s capacity for model training. To address this issue, consider either increasing the number of GPUs and CPUs or adjusting training parameters by reducing the batch size or increasing gradient accumulation steps.
-
You can attempt to use run_deidentify_notes.sh to either deidentify your raw clinical note or CoT explanations.
Please fine-tune your own model first. Please follow the inference section of the inference.py to run your model.
You can use RareDAI_codebook.ipynb for real-time inference. Please adjust the input arguments as needed. An example is provided to help you get started—it typically runs in under 5 seconds. The synthetic clinical note included here was generated by ChatGPT. While the content is simulated, the phenotypes, ICD-10 descriptions, and other clinical details are designed to closely mimic real patient data that was recommended with a recommendation for genome sequencing.
Otherwise, please use the following command for inference:
sbatch -p gpuq --gres=gpu:a100:2 --cpus-per-gpu=3 --mem-per-cpu=50G --job-name=RareDAI_inference --wrap="bash run_inference.sh -i ...input.json -o output/ -model_dir your_model_path"| Argument | Description |
|---|---|
-i, --input |
Required. Path to your input data. Can be a .json, .pkl, or a folder containing .txt or image files. |
-o, --output |
Required. Output directory name. This is where results will be saved. The directory will be created if it does not exist. |
-model_dir |
Required. Path to the fine-tuned model directory (e.g. a full fine-tuned or LoRA/QLoRA weights). If not provided, defaults will be used. |
| Argument | Description |
|---|---|
-lora, --lora |
Provide LoRA-adapted model path. Please use foundation model for -model_dir if you use LoRA |
-qlora, --qlora |
Provide QLoRA-adapted model path. Please use foundation model for -model_dir if you use QLoRA. |
Quan Minh Nguyen - Bioengineering PhD student at the University of Pennsylvania
Dr. Kai Wang - Professor of Pathology and Laboratory Medicine at the University of Pennsylvania and Children's Hospital of Philadelphia
The paper is preparing!