FaithBench: a human-annotated benchmark on challenging summarization hallucinations of modern LLMs

FaithBench is a benchmark for summarization hallucinations, featuring human-annotated hallucinations in summaries generated by 10 modern LLMs across 8 different model families. The ten LLMs are:

GPT-4o
GPT-3.5-turbo
Llama-3.1-70B
Gemini-1.5-Flash
Llama-3.1-8B
Claude-3.5-Sonnet
Qwen-2.5-7B
Phi-3-mini-4k
Command-R
Mistral-7B

As of November 2024, FaithBench contains 750 samples.

Annotations

Human annotations are in assign/batch_5_src_no_sports/results/batch_{1..16}_annotation.json.

Each annotation is a span in the summary text that is considered or suspected to be a hallucination, and is labeled with one or more of the following categories:

Unwanted, which includes the following subcategories:
- Intrinsic
- Extrinsic
Benign
Questionable

If an annotation is related to a span in the source text, the source span is also included in the annotation. An annotation may also include a note that explains the reason why the human annotation believes or suspects the summary span as a hallucination.

The format of the annotation JSON file is as follows:

[
  {
    "sample_id": 0,
    "source": "Poseidon (film) . Poseidon grossed $ 181,674,817 at the worldwide box office on a budget of $ 160 million .",
    "summary": " The film \"Poseidon\" grossed $181,674,817 at the worldwide box office, with a production budget of $160 million.",
    "annotations": [
      {
        "annot_id": 1,
        "sample_id": 0,
        "annotator": "a3ac21668e6249b7978617da547f2708",
        "label": [
          "Unwanted",
          "Unwanted.Instrinsic"
        ],
        "note": "\"budget\" (source) vs. \"production budget\" (summary)\nThe budget for a movie may also include non-production budget such as distribution, advertising. ",
        "annotator_name": "XXXX",
        "summary_span": "production",
        "summary_start": 78,
        "summary_end": 88
      },
      {
        "annot_id": 60,
        "sample_id": 0,
        "annotator": "69a785fa7f454e7da5eef3c608b2133a",
        "label": [
          "Unwanted",
          "Unwanted.Instrinsic"
        ],
        "note": "\"budget\" (source) vs. \"production budget\" (summary) The budget for a movie may also include non-production budget such as distribution, advertising. ",
        "annotator_name": "XXXX",
        "summary_span": "production",
        "summary_start": 78,
        "summary_end": 88
      }
    ],
    "meta_model": "mistralai/Mistral-7B-Instruct-v0.3",
    "meta_hhemv1": 0.9995,
    "meta_hhem-2.1": 0.52694,
    "meta_hhem-2.1-english": 0.98313,
    "meta_trueteacher": 1,
    "meta_true_nli": 1,
    "meta_gpt-3.5-turbo": 1,
    "meta_gpt-4-turbo": 1,
    "meta_gpt-4o": 1,
    "meta_sample_id": 15
  },
  ...
]

How FaithBench was created

Identify samples worth human annotation: Our data collection starts from Vectara's Hallucination Leaderboard where summaries have been generated by dozens of LLMs from over one thousand news articles. We then used 4 hallucination detectors (Vectara's HHEM-1 and HHEM-2.1 and Google's TrueTeacher and TrueNLI) to predict the hallucination scores of the summaries. We selected samples that are most disagreed by the 4 detectors for human annotation. Our rationale is that a sample that can be unanimously agreed by the detectors are easy cases and do not worth human annotation.
Distribute human annotation tasks: Samples are then distributed to human annotators via Mercury, a platform that we developed for this purpose. Each sample is annotated by 2-3 annotators.
Collect human annotations and analyze them: The annotations from Mercury as SQLite files. We then fetch them, export them to JSONL format, and finally analyze them.

File hierarchy

In the order code is executed:

backup_data_with_detector_results: the summaries generated by 10 selected LLMs and the hallucination scores predicted by 4 hallucination detectors
data_collection_scripts: scripts for collecting predictions from various detectors and selecting hard samples for annotation
assign: data and scripts for human annotation (generation of annotation tasks, distribution of tasks, collection of annotations, and analysis of annotations)
- File examples_to_annotate.csv: The samples selected for human annotation
- Folder pilot: data and scripts for pilot annotation
- Folder batch_5_src_no_sports: the main annotation batches
  - results: the annotation results, both SQLite database files and exported JSONL files. Filename nomenclature: batch_{1..16}_annotation.json and batch_{1..16}.sqlite.
  - {ingest, fetch_db, dump_db}.sh: scripts to ingest (including embedding and vector database stuff), fetch, and dump the SQLite database files
- Script create_batch.py: Slicing data into 50 samples per batch, excluding sports-related samples
- Script generate_faithbench.ipynb: Notebook for generating FaithBench.csv based on all annotations without filtering. Map the labels on text spans to the labels on samples, using "worst-pooling" and "best-pooling" strategies
- dataset_stats.ipynb: Script for getting basic statistics of dataset
eval: data files and scripts for analysis, focusing on sent-level and sample-level evaluations
- sent_level_results Folder: data files for sentence-level evaluation
- Script eval_{detectors, detectors_sent_level, llm}.ipynb: Evaluation of the performance of hallucination detectors (at sample-level and sentence-level) and LLMs
- Script interannotator_agreement.ipynb: Calculate inter-annotator agreement on the annotations

Name		Name	Last commit message	Last commit date
Latest commit History 115 Commits
assign		assign
backup_content_hhem_repo		backup_content_hhem_repo
backup_data_with_detector_results		backup_data_with_detector_results
data_collection_scripts		data_collection_scripts
eval		eval
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

FaithBench: a human-annotated benchmark on challenging summarization hallucinations of modern LLMs

Annotations

How FaithBench was created

File hierarchy

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

forrestbao/FaithBench

Folders and files

Latest commit

History

Repository files navigation

FaithBench: a human-annotated benchmark on challenging summarization hallucinations of modern LLMs

Annotations

How FaithBench was created

File hierarchy

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages