Skip to content

LLM Annotations

Zdenek Kasner edited this page Sep 18, 2025 · 12 revisions

With factgenie, you can query LLMs to annotate spans in the generated outputs:

LLM eval scheme

✨ LLM APIs

The LLMs need to run externally: factgenie will query them through an API.

We are using the LiteLLM library for unified access to mutliple providers.

Currently, factgenie supports the following APIs:

Provider Type Variables / parameters required
OpenAI Proprietary OPENAI_API_KEY
Anthropic Proprietary ANTHROPIC_API_KEY
Gemini Proprietary GEMINI_API_KEY
Google VertexAI Proprietary VERTEXAI_LOCATION, VERTEXAI_PROJECT, VERTEXAI_JSON_FULL_PATH
Ollama Local api_url parameter
VLLM Local api_url parameter

Caution

Use only the most capable models for satisfactory results. Smaller and less capable models may often fail to generate outputs in a valid format, return empty outputs, invalid annotations etc.

Tip

In principle, factgenie can operate with any API as long as the response is in JSON format: see factgenie/metrics.py. If you wish to contribute here, please see the Contributing guidelines.

🚦 Setting up an LLM evaluation campaign

You can easily set up a new LLM eval (i.e., a campaign using LLMs for annotations) through the web interface:

  1. Go to /llm_eval and click on New LLM campaign.
  2. Insert a unique campaign ID .
  3. Configure the LLM evaluator.
  4. Select the datasets and splits you want to annotate.

Let us now look into the steps 3 and 4 in more details.

πŸ› οΈ Configuring the LLM evaluator

The fields you need to configure are the following:

  • Annotation span categories: Names, colors, and descriptions of the categories you want to annotate.

    • The colors will be later used for the highlights in the web interface.
    • You need to specify the details about the categories in the prompt (see below).
  • Prompting strategy: How to prompt the model and parse the output.

    • By default, we use structured output with fixed JSON scheme to get annotations (see below).
    • Use parse_raw for models that do not support structured output or if you need to preserve the model's reasoning trace.
  • Annotation JSON: Configure which fields should be included in the JSON schema for annotations.

    • text: The actual text span to be annotated (mandatory, always enabled).
    • annotation_type: Index to the annotation category (mandatory, always enabled).
    • reason: Include explanation of why this text should be annotated (enabled by default).
    • occurence_index: Include 0-based index to disambiguate between multiple occurrences of the same text (disabled by default).
    • Important: Your prompt template must clearly describe the purpose and format of each enabled field.
  • Prompt template: The prompt for the model.

Important

See below for the instructions for setting up the prompt.

  • System message: The text describing the role of the model.
  • API provider: The LLM evaluator you will be calling through an API,
  • API URL: For local providers, this is the API URL, e.g. http://my-server.com:11434/api/generate. The parameter is ignored for proprietary APIs.
  • Model: The identifier of the model you are querying.
  • Model arguments: Arguments for the model API, e.g., temperature, top-k, seed, etc.
  • Extra arguments: Custom arguments for the metric class. We currently support:
    • stopping_sequence (str): The model output will be finished once this sequence is encountered.
    • remove_suffix (str): This suffix will be stripped from the model output.
    • with_reason (bool): Include reason field in annotations (default: true). Can also be set via Annotation JSON checkboxes.
    • with_occurence_index (bool): Include occurrence index field in annotations (default: false). Can also be set via Annotation JSON checkboxes.
  • Allow overlapping annotations: Whether the model should be allowed to produce overlapping annotations.

The pre-defined YAML configurations for LLM campaigns are stored in factgenie/config/llm-eval. If you wish, you can also edit these files manually. You can also save the configuration in a YAML file through the web interface.

After creating the campaign, all the configuration parameters will be saved in the file factgenie/campaigns/<llm-eval-id>/metadata.json (and also alongside each generated example).

πŸ’¬ Configuring the prompt

It is important to set up the prompt for the model correctly so that you get accurate results from the LLM evaluator.

To help you with that, you can use the following two buttons:

  • ✨ Pre-fill prompt template
    • This button will insert a basic prompt template with your custom error categories into the prompt template textbox.
    • You can then modify the prompt template in the textbox.
    • You can also modify the basic template itself in factgenie/config/default_prompts.yml.
  • πŸ“ Add example to template
    • This button will open a wizard for adding an example of error annotation.
    • Follow the instructions in the wizard for adding an example.
    • The example will be appended to the existing prompt template.
    • It is highly recommended to add at least one example to the prompt.

If you decide to write the prompt manually, follow carefully the following instructions.

For including the input data and the generated output in the prompt, use the placeholders:

  • {data} for inserting the raw representation of the input data,
  • {data[key]} for accessing specific keys in the data dictionary,
  • {data[key][subkey]} for accessing nested dictionary values (any depth is supported),
  • {text} for inserting the output text.

The placeholders will be replaced with the actual values for each example.

We are using Pydantic to receive structured output from the model. The exact format depends on your Annotation JSON configuration:

Full annotation model (with reason and occurrence index):

class SpanAnnotationOccurenceIndex(BaseModel):
    reason: str = Field(description="The reason for the annotation.")
    text: str = Field(description="The text which is annotated.")
    annotation_type: int = Field(description="Index to the list of span annotation types.")
    occurence_index: int = Field(description="0-based index for multiple occurrences of the same text.")

Standard annotation model (with reason only, default):

class SpanAnnotation(BaseModel):
    reason: str = Field(description="The reason for the annotation.")
    text: str = Field(description="The text which is annotated.")
    annotation_type: int = Field(description="Index to the list of span annotation types.")

Minimal annotation models are also available without reason field or with occurrence index only.

This translates to the following JSON structure (example with all fields enabled):

{
  "annotations": [
    { 
      "reason": "[REASON]",
      "text": "[TEXT_SPAN]",
      "annotation_type": [CATEGORY_INDEX],
      "occurence_index": [OCCURRENCE_INDEX]
    },
    ...
  ]
}

where:

  • REASON is a reasoning trace about the annotation (visible on hover in the web interface, optional).
  • TEXT_SPAN is the actual text snippet from the output (mandatory).
  • CATEGORY_INDEX is a number from the list of annotation categories (mandatory).
  • OCCURRENCE_INDEX is a 0-based index for disambiguating multiple occurrences of the same text (optional).

Note: Only the fields you enable in "Annotation JSON" will be included in the actual schema.

Important

Even though Pydantic ensures that the response will be in this specific format, you still need to prompt the model to produce JSON outputs in this format (otherwise factgenie may fail to parse model responses).

For instructing the model about the annotation categories, you can include a variant of the following snippet in the prompt (customized to your needs):

The value of "annotation_type" is one of {0, 1, 2, 3} based on the following list:
- 0: Incorrect fact: The fact in the text contradicts the data.
- 1: Not checkable: The fact in the text cannot be checked in the data.
- 2: Misleading: The fact in the text is misleading in the given context.
- 3: Other: The text is problematic for another reason, e.g. grammatically or stylistically incorrect, irrelevant, or repetitive.

If using occurrence index, also include instructions like:

If the same text appears multiple times and you need to annotate a specific occurrence, use "occurence_index" (0-based):
- For the first occurrence: "occurence_index": 0
- For the second occurrence: "occurence_index": 1
- And so on...

Example: If "Paris" appears twice in the text and the second occurrence is incorrect:
{"text": "Paris", "annotation_type": 0, "reason": "Wrong context", "occurence_index": 1}

We perform forward string matching to locate the annotated spans in the text:

Basic matching (when occurence_index is not provided):

  • If we match the string in the output, we shift the initial position:
    • to the first character of the currently matched string (if overlap is allowed)
    • to the next character after the currently matched string (if overlap is not allowed)
  • If we do not match the string in the output, we ignore the annotation.

Occurrence-aware matching (when occurence_index is provided):

  • We find all occurrences of the text span in the output.
  • We select the occurrence at the specified 0-based index (e.g., "occurence_index": 1 selects the second occurrence).
  • If the index is invalid (e.g., asking for the 3rd occurrence when only 2 exist), we fall back to the first occurrence.
  • This is particularly useful when the same text appears multiple times and you need to annotate a specific occurrence.

Caution

You should ask the model to order the annotations sequentially for this algorithm to work properly.

πŸ“šοΈ Configuring the data

In the next step, you can select the datasets and splits you want to annotate.

Note that for make the selection process easier, we always select a cartesian product of the selected datasets, splits, and model outputs (existing combinations only).

You can then filter the selected combinations in the box below.

Data selection

After the campaign is created, the selected examples will be listed in factgenie/annotations/<llm-eval-id>/db.csv. You can edit this file before starting the campaign if you wish to customize the data selection, e.g. down to specific examples.

πŸ€– Running a LLM eval

After the LLM evaluation campaign is created, it will appear in the list on the /llm_eval page:

LLM eval

Now you need to run the evaluation by clicking the "play" button. The annotated examples will be marked as finished:

LLM eval detail

You can view the annotations from the model as soon as they are received.

πŸ’» Command line interface

Alternatively, you can run an evaluation campaign from the command line, see the page on command line interface.

Clone this wiki locally