FaithBench is a benchmark for summarization hallucinations, featuring human-annotated hallucinations in summaries generated by 10 modern LLMs across 8 different model families. The ten LLMs are:
- GPT-4o
- GPT-3.5-turbo
- Llama-3.1-70B
- Gemini-1.5-Flash
- Llama-3.1-8B
- Claude-3.5-Sonnet
- Qwen-2.5-7B
- Phi-3-mini-4k
- Command-R
- Mistral-7B
As of November 2024, FaithBench contains 750 samples.
Human annotations are in assign/batch_5_src_no_sports/results/batch_{1..16}_annotation.json.
Each annotation is a span in the summary text that is considered or suspected to be a hallucination, and is labeled with one or more of the following categories:
- Unwanted, which includes the following subcategories:
- Intrinsic
- Extrinsic
- Benign
- Questionable
If an annotation is related to a span in the source text, the source span is also included in the annotation. An annotation may also include a note that explains the reason why the human annotation believes or suspects the summary span as a hallucination.
The format of the annotation JSON file is as follows:
[
{
"sample_id": 0,
"source": "Poseidon (film) . Poseidon grossed $ 181,674,817 at the worldwide box office on a budget of $ 160 million .",
"summary": " The film \"Poseidon\" grossed $181,674,817 at the worldwide box office, with a production budget of $160 million.",
"annotations": [
{
"annot_id": 1,
"sample_id": 0,
"annotator": "a3ac21668e6249b7978617da547f2708",
"label": [
"Unwanted",
"Unwanted.Instrinsic"
],
"note": "\"budget\" (source) vs. \"production budget\" (summary)\nThe budget for a movie may also include non-production budget such as distribution, advertising. ",
"annotator_name": "XXXX",
"summary_span": "production",
"summary_start": 78,
"summary_end": 88
},
{
"annot_id": 60,
"sample_id": 0,
"annotator": "69a785fa7f454e7da5eef3c608b2133a",
"label": [
"Unwanted",
"Unwanted.Instrinsic"
],
"note": "\"budget\" (source) vs. \"production budget\" (summary) The budget for a movie may also include non-production budget such as distribution, advertising. ",
"annotator_name": "XXXX",
"summary_span": "production",
"summary_start": 78,
"summary_end": 88
}
],
"meta_model": "mistralai/Mistral-7B-Instruct-v0.3",
"meta_hhemv1": 0.9995,
"meta_hhem-2.1": 0.52694,
"meta_hhem-2.1-english": 0.98313,
"meta_trueteacher": 1,
"meta_true_nli": 1,
"meta_gpt-3.5-turbo": 1,
"meta_gpt-4-turbo": 1,
"meta_gpt-4o": 1,
"meta_sample_id": 15
},
...
]- Identify samples worth human annotation: Our data collection starts from Vectara's Hallucination Leaderboard where summaries have been generated by dozens of LLMs from over one thousand news articles. We then used 4 hallucination detectors (Vectara's HHEM-1 and HHEM-2.1 and Google's TrueTeacher and TrueNLI) to predict the hallucination scores of the summaries. We selected samples that are most disagreed by the 4 detectors for human annotation. Our rationale is that a sample that can be unanimously agreed by the detectors are easy cases and do not worth human annotation.
- Distribute human annotation tasks: Samples are then distributed to human annotators via Mercury, a platform that we developed for this purpose. Each sample is annotated by 2-3 annotators.
- Collect human annotations and analyze them: The annotations from Mercury as SQLite files. We then fetch them, export them to JSONL format, and finally analyze them.
In the order code is executed:
backup_data_with_detector_results: the summaries generated by 10 selected LLMs and the hallucination scores predicted by 4 hallucination detectorsdata_collection_scripts: scripts for collecting predictions from various detectors and selecting hard samples for annotationassign: data and scripts for human annotation (generation of annotation tasks, distribution of tasks, collection of annotations, and analysis of annotations)- File
examples_to_annotate.csv: The samples selected for human annotation - Folder
pilot: data and scripts for pilot annotation - Folder
batch_5_src_no_sports: the main annotation batchesresults: the annotation results, both SQLite database files and exported JSONL files. Filename nomenclature:batch_{1..16}_annotation.jsonandbatch_{1..16}.sqlite.{ingest, fetch_db, dump_db}.sh: scripts to ingest (including embedding and vector database stuff), fetch, and dump the SQLite database files
- Script
create_batch.py: Slicing data into 50 samples per batch, excluding sports-related samples - Script
generate_faithbench.ipynb: Notebook for generatingFaithBench.csvbased on all annotations without filtering. Map the labels on text spans to the labels on samples, using "worst-pooling" and "best-pooling" strategies dataset_stats.ipynb: Script for getting basic statistics of dataset
- File
eval: data files and scripts for analysis, focusing on sent-level and sample-level evaluationssent_level_resultsFolder: data files for sentence-level evaluation- Script
eval_{detectors, detectors_sent_level, llm}.ipynb: Evaluation of the performance of hallucination detectors (at sample-level and sentence-level) and LLMs - Script
interannotator_agreement.ipynb: Calculate inter-annotator agreement on the annotations