UNIT-Ref is a novel IFT paradigm to address hallucination by teaching LLMs to recognize their uncertainty and explicitly reflect it at the end of their responses. This is the official repository for our paper.
First, install python dependences
python -m venv ccp
source ccp/bin/activate
pip install -r requirements_ccp.txt
deactivateand
python -m venv vllm
source ccp/bin/activate
pip install -r requirements_vllm.txt
deactivateTo create uncertainty-aware training data, we first need to probe the Claim-Conditioned Probability (CCP) uncertainty of a base model on an Instruction Fine-Tuning (IFT) dataset. This step is crucial as it identifies which parts of the model's responses are uncertain, which will later be used to guide the training process.
For example, to obtain Qwen/Qwen2.5-14B's CCP uncertainty on an IFT dataset (e.g. lfrqa.jsonl), download and put the lfrqa.jsonl dataset into the folder data_to_probe_ccp and run:
bash eval_ccp.sh --model_name Qwen/Qwen2.5-14B --get_ccp_from_response datasets/lfrqa.jsonlThis command will:
- Load the specified model (Qwen/Qwen2.5-14B)
- Process each instruction-response pair in the dataset
- Extract atomic claims from the responses
- Calculate CCP uncertainty scores for each claim
You will find the result in evaluate_database/calibration_result/lfrqa_calibrated_Qwen_Qwen2.5-14B.jsonl. The output file contains:
instruction: Original instructions from the datasetresponse: Model's original responsesclaim_uncertainty: a list containing dictionaries recording each claim with their ccp values.
After obtaining CCP uncertainty scores, we need to identify which instructions are information-seeking tasks (as opposed to creative or reasoning tasks). This classification is important because UNIT methods are specifically designed for information-seeking scenarios where factual accuracy is paramount.
cd src
python instruction_classification.py \
--instruction_file evaluate_database/calibration_result/lfrqa_calibrated_Qwen_Qwen2.5-14B.jsonl \
--output_file train_data/lfrqa_calibrated_Qwen_Qwen2.5-14B_labelled.jsonl \
--llm gpt-4o \
--batch 20This script will:
- Analyze each instruction in the dataset
- Use GPT-4o to classify the task of the instruction is asking.
- Add classification labels to the dataset
The output file will include an additional field:
primary_tag: primary task/genre of the instruction.other_tags: other task/genre of the instruction.
The UNIT-Reflection dataset teaches models to explicitly acknowledge uncertainty by adding reflection sections to their responses. This approach maintains the original response while reflecting on uncertain claims.
To obtain the UNIT-Reflection Dataset, run:
cd src
python make_data_ref.py\
--input_file train_data/lfrqa_calibrated_Qwen_Qwen2.5-14B_labelled.jsonl \
--output_file train_data/lfrqa_calibrated_Qwen_Qwen2.5-14B_unit_reflect.jsonl \This script will:
- Process each information-seeking only instruction from the labeled dataset
- Identify uncertain claims based on CCP scores
- Generate reflection sections that explicitly acknowledge uncertainty
- Create training examples where responses include both the original answer and a reflection on uncertain content
The resulting dataset format includes:
- Original instruction-response pairs
- Added reflection sections highlighting uncertain claims
- Maintained informativeness while increasing honesty
The UNIT-Cut dataset takes a different approach by removing uncertain content entirely, creating more conservative but highly truthful responses. This method prioritizes truthfulness over informativeness.
To obtain the UNIT-Cut Dataset, run:
cd src
export OPENAI_API_KEY=xxx
python make_data_cut.py\
--input_file evaluate_database/calibration_result/lima_calibrated_Qwen_Qwen2.5-14B_labelled.jsonl \
--output_file train_data/lima_calibrated_Qwen_Qwen2.5-14B_unit_cut.jsonl \
--llm gpt-4oThis script will:
- Identify uncertain claims in the original responses using CCP scores
- Use GPT-4o to intelligently remove uncertain content while maintaining coherence
- Ensure the remaining content flows naturally after uncertain parts are removed
- Create training examples with conservative, high-confidence responses
The resulting dataset features:
- Shortened responses with uncertain content removed
- Higher truthfulness at the cost of some informativeness
- Maintained response quality and coherence
Note: Both UNIT-Reflection and UNIT-Cut datasets require the previous steps (CCP probing and instruction classification) to be completed first, as they depend on uncertainty scores and information-seeking labels to create effective training data.
We used Alignment Handbook to fine-tune our models using the dataset created by UNIT-Ref and UNIT-Cut, the sample training configs containing all our training hyperparameters used can be found in training_configs folder.
To evaluate the fine-tuned checkpoints with CCP Balanced Accuracy on datasets Biography (bio) or WildHalu, run:
cd src/bash_scripts/eval_ccp.sh
bash eval_ccp.sh --model_name your_checkpoint_path --test_data [bio|wildhalu] --output_files results/qwen25_limalfrqa01_reflect_bio.jsonlmodel_name: address to your fine-tuned checkpoint_pathOutput Files: results/qwen25_limalfrqa01_reflect_bio.jsonl Useful Fields:instruction: Instructions from bio or wildhalu dataset.response: Response of the target checkpoint.claim_uncertainty: Atomic claims and their CCP uncertainty scores.reflected_answer_claims: Answer claims that are marked uncertain by the reflection section.unreflected_answer_claim: Answer claims NOT flagged by the reflection section.
CCP-Based Scores: CCP Diff, CCP Honesty will be printed out at the end of the script.
# Here we use limalfrqa01_reflect as an example, other settings can be evaluated similarly
cd src/bash_scripts
bash eval_honesty.sh --input_file results/qwen25_limalfrqa01_reflect_bio.jsonl --output_file results/qwen25_limalfrqa01_reflect_bio_honesty.jsonl --database enwikiThis command:
- Changes to the
src/bash_scriptsdirectory - Runs
eval_honesty.shwith the following parameters:--input_file: Points to the output file from the CCP evaluation (in this case using the Qwen2.5 model's results on the biography dataset)--output_file: Specifies where to save the honesty evaluation results--database:enwikifor FactScore orwildhalufor WildFactScore.
The script will:
- Check if the fact-check database is downloaded, and download it if needed
- Run fact-checking on the input file
- Calculate truthfulness and honesty scores
- Compute the upper bound of the honesty balanced accuracy score
The output will include:
- Truthfulness score (printed as "Overall Acc.")
- Honesty balanced accuracy score (printed as "Macro Avg.")
- Upper bound of the honesty balanced accuracy score, given the ground truth CCP rank and the best possible CCP threshold.
# Helpfulness evaluation requires OPENAI LLM Access
export OPENAI_API_KEY=xxx
# Here we use limalfrqa01_reflect as an example, other settings can be evaluated similarly
cd src/bash_scripts
bash eval_helpfulness.sh --input_file results/qwen25_limalfrqa01_reflect_bio.jsonl --ref_file results/qwen25_lima_vanilla_bio_atomic_fc.jsonl --output_file results/qwen25_limalfrqa01_reflect_bio_helpfulness.xlsx --dataset bioThis command:
- Changes to the
src/bash_scriptsdirectory - Runs
eval_helpfulness.shwith the following parameters:--input_file: Path to the output file from the CCP evaluation--ref_file: Path to the reference file to compare against. Here we use Qwen2.5 fine-tuned on LIMA without UNIT.--output_file: Specifies where to save the helpfulness evaluation results (must end with .xlsx)--dataset: bio or wildhalu
The script will:
- Compare the responses from both files using GPT-4o to evaluate helpfulness
- Calculate a helpfulness score based on the comparisons
- Save detailed evaluation results in the specified Excel file
The output will include:
- A helpfulness score (printed as "Score")
- A detailed Excel file containing:
- The original prompts
- Responses compared with position swapping
- The judge's evaluation and reasoning
- The final verdict for each comparison
For UNIT_CUT, you don't need to run the CCP evaluation, but need to extract atomic claims first for fact-checking
# Atomic Claim Extraction requires OPENAI LLM Access
export OPENAI_API_KEY=xxx
# Here, we use limalfrqa01_cut as an example, other settings can be evaluated similarly
# Step 1: Evaluate Truthfulness
cd src/bash_scripts
conda activate vllm
bash eval_truthfulness.sh --model_path PATH_TO_limalfrqa01_cut_CHECKPOINT --setting_name limalfrqa01_cutThis command:
- Changes to the
src/bash_scriptsdirectory - Runs
eval_truthfulness.shwith the following parameters:--model_path: Path to your UNIT_CUT checkpoint--setting_name: Name of the setting (in this case "limalfrqa01_cut")
The script will:
- Inference using
--model_path, example output fileresults/qwen25_limalfrqa01_cut_bio.csv - Extract atomic claims from the model's responses, example output file
results/qwen25_limalfrqa01_cut_bio_atomic.jsonl - Use the English Wikipedia database for fact-checking, example output file
results/qwen25_limalfrqa01_cut_bio_atomic_fc.jsonl - Print truthfulness scores as
Macro Avg.
# Helpfulness evaluation requires OPENAI LLM Access
export OPENAI_API_KEY=xxx
# Step 2: Evaluate Helpfulness
bash eval_helpfulness.sh --input_file results/qwen25_limalfrqa01_cut_bio_atomic_fc.jsonl --ref_file results/qwen25_lima_vanilla_bio_atomic_fc.jsonl --output_file results/qwen25_limalfrqa01_cut_bio_helpfulness.xlsx --dataset bioThen runs eval_helpfulness.sh which:
- Compares the UNIT_CUT model's responses with the vanilla model's responses
- Uses GPT-4o to evaluate helpfulness
- Outputs detailed comparison results to an Excel file
- Calculates and displays a helpfulness score
This repository contains 2 types of datasets:
- In datasets folder, we provide:
- LIMA and LFRQA's original datasets.
- LIMA and LFRQA's dataset with their probed ccp values wrt. Llama3.1-8b and Qwen2.5-14B, uncertain claims and certain claims classified by quantile 75 is also included in these datasets.
- In src/inference_data, we provide WildHalu (500 subset we used) and Biography datasets for evaluation.