Skip to content

forrestbao/FaithBench

Repository files navigation

FaithBench: a human-annotated benchmark on challenging summarization hallucinations of modern LLMs

FaithBench is a benchmark for summarization hallucinations, featuring human-annotated hallucinations in summaries generated by 10 modern LLMs across 8 different model families. The ten LLMs are:

  • GPT-4o
  • GPT-3.5-turbo
  • Llama-3.1-70B
  • Gemini-1.5-Flash
  • Llama-3.1-8B
  • Claude-3.5-Sonnet
  • Qwen-2.5-7B
  • Phi-3-mini-4k
  • Command-R
  • Mistral-7B

As of November 2024, FaithBench contains 750 samples.

Annotations

Human annotations are in assign/batch_5_src_no_sports/results/batch_{1..16}_annotation.json.

Each annotation is a span in the summary text that is considered or suspected to be a hallucination, and is labeled with one or more of the following categories:

  • Unwanted, which includes the following subcategories:
    • Intrinsic
    • Extrinsic
  • Benign
  • Questionable

If an annotation is related to a span in the source text, the source span is also included in the annotation. An annotation may also include a note that explains the reason why the human annotation believes or suspects the summary span as a hallucination.

The format of the annotation JSON file is as follows:

[
  {
    "sample_id": 0,
    "source": "Poseidon (film) . Poseidon grossed $ 181,674,817 at the worldwide box office on a budget of $ 160 million .",
    "summary": " The film \"Poseidon\" grossed $181,674,817 at the worldwide box office, with a production budget of $160 million.",
    "annotations": [
      {
        "annot_id": 1,
        "sample_id": 0,
        "annotator": "a3ac21668e6249b7978617da547f2708",
        "label": [
          "Unwanted",
          "Unwanted.Instrinsic"
        ],
        "note": "\"budget\" (source) vs. \"production budget\" (summary)\nThe budget for a movie may also include non-production budget such as distribution, advertising. ",
        "annotator_name": "XXXX",
        "summary_span": "production",
        "summary_start": 78,
        "summary_end": 88
      },
      {
        "annot_id": 60,
        "sample_id": 0,
        "annotator": "69a785fa7f454e7da5eef3c608b2133a",
        "label": [
          "Unwanted",
          "Unwanted.Instrinsic"
        ],
        "note": "\"budget\" (source) vs. \"production budget\" (summary) The budget for a movie may also include non-production budget such as distribution, advertising. ",
        "annotator_name": "XXXX",
        "summary_span": "production",
        "summary_start": 78,
        "summary_end": 88
      }
    ],
    "meta_model": "mistralai/Mistral-7B-Instruct-v0.3",
    "meta_hhemv1": 0.9995,
    "meta_hhem-2.1": 0.52694,
    "meta_hhem-2.1-english": 0.98313,
    "meta_trueteacher": 1,
    "meta_true_nli": 1,
    "meta_gpt-3.5-turbo": 1,
    "meta_gpt-4-turbo": 1,
    "meta_gpt-4o": 1,
    "meta_sample_id": 15
  },
  ...
]

How FaithBench was created

  1. Identify samples worth human annotation: Our data collection starts from Vectara's Hallucination Leaderboard where summaries have been generated by dozens of LLMs from over one thousand news articles. We then used 4 hallucination detectors (Vectara's HHEM-1 and HHEM-2.1 and Google's TrueTeacher and TrueNLI) to predict the hallucination scores of the summaries. We selected samples that are most disagreed by the 4 detectors for human annotation. Our rationale is that a sample that can be unanimously agreed by the detectors are easy cases and do not worth human annotation.
  2. Distribute human annotation tasks: Samples are then distributed to human annotators via Mercury, a platform that we developed for this purpose. Each sample is annotated by 2-3 annotators.
  3. Collect human annotations and analyze them: The annotations from Mercury as SQLite files. We then fetch them, export them to JSONL format, and finally analyze them.

File hierarchy

In the order code is executed:

  • backup_data_with_detector_results: the summaries generated by 10 selected LLMs and the hallucination scores predicted by 4 hallucination detectors
  • data_collection_scripts: scripts for collecting predictions from various detectors and selecting hard samples for annotation
  • assign: data and scripts for human annotation (generation of annotation tasks, distribution of tasks, collection of annotations, and analysis of annotations)
    • File examples_to_annotate.csv: The samples selected for human annotation
    • Folder pilot: data and scripts for pilot annotation
    • Folder batch_5_src_no_sports: the main annotation batches
      • results: the annotation results, both SQLite database files and exported JSONL files. Filename nomenclature: batch_{1..16}_annotation.json and batch_{1..16}.sqlite.
      • {ingest, fetch_db, dump_db}.sh: scripts to ingest (including embedding and vector database stuff), fetch, and dump the SQLite database files
    • Script create_batch.py: Slicing data into 50 samples per batch, excluding sports-related samples
    • Script generate_faithbench.ipynb: Notebook for generating FaithBench.csv based on all annotations without filtering. Map the labels on text spans to the labels on samples, using "worst-pooling" and "best-pooling" strategies
    • dataset_stats.ipynb: Script for getting basic statistics of dataset
  • eval: data files and scripts for analysis, focusing on sent-level and sample-level evaluations
    • sent_level_results Folder: data files for sentence-level evaluation
    • Script eval_{detectors, detectors_sent_level, llm}.ipynb: Evaluation of the performance of hallucination detectors (at sample-level and sentence-level) and LLMs
    • Script interannotator_agreement.ipynb: Calculate inter-annotator agreement on the annotations

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages