This method identifies safety-sensitive layers within the LLM and leverages data bi-representation to detect safety-degrading data samples in the fine-tuning dataset.
-
Clone the repository:
git clone https://github.com/LLLeoLi/LARF.git cd LARF -
Install dependencies:
conda create --name LARF python=3.10 conda activate LARF pip install -r requirements.txt
bash scripts/scaling_llama.shpython get_bi_rep_llama.py --layer_num_start 10 --layer_num_end 11Train the model with custom datasets:
bash scripts/train_multipule_llama.shEvaluate model outputs for safety:
bash scripts/llama_guard.sh
python llama_guard.pyOpen the Jupyter notebook for analysis:
jupyter notebook analysis.ipynb