Skip to content

A Python library providing evaluation metrics to compare generated texts from LLMs, often against reference texts. Features streamlined workflows for model comparison and visualization.

License

Notifications You must be signed in to change notification settings

ai4society/GenAIResultsComparator

Repository files navigation

GAICo: GenAI Results Comparator

PyPI version Python versions License
PyPI Downloads arXiv Deploy Docs Documentation

GenAI Results Comparator (GAICo) is a Python library for comparing, analyzing, and visualizing outputs from Large Language Models (LLMs). It offers an extensible range of metrics, including standard text similarity scores, specialized metrics for structured data like planning sequences and time-series, and multimedia metrics for image and audio.

Important Links:

Video Demo

GAICo YouTube Demo
Click to watch the GAICo demo on YouTube

News

This section summarizes the major releases of the GAICo library, highlighting key features and providing quick start examples. For more details, please refer to the news pages.

Release Name Date Summary of Changes More Info
v0.3.0 August 2025 Added new multimedia metrics (image and audio) and enhancements for the Experiment class Details for v0.3.0
v0.2.0 July 2025 Added specialized text metrics: time-series & automated planning. Details for v0.2.0
v0.1.5 June 2025 Initial release: generic text metrics, Experiment class, & visualizations. Details for v0.1.5

Quick Start

GAICo makes it easy to evaluate and compare LLM outputs. The following python (Jupyter) notebooks showcase different usecases:

  • quickstart.ipynb: A rapid hands-on introduction to the Experiment class. Open In Colab

  • example-1.ipynb: A fine-grained example comparing multiple model outputs with a single metric. Open In Colab

  • example-2.ipynb: A fine-grained example evaluating a single model output across all available metrics. Open In Colab

Tip

For more, detailed and runnable, examples, check out all our Jupyter Notebooks in the examples/ folder. You can run them directly in Google Colab!

Streamlined Workflow with Experiment

For a more integrated approach to comparing multiple models, applying thresholds, generating plots, and creating CSV reports, the Experiment class offers a convenient abstraction.

Quick Example

This example demonstrates comparing multiple LLM responses against a reference answer using specified metrics, generating a plot, and outputting a CSV report.

from gaico import Experiment

# Sample data from https://arxiv.org/abs/2504.07995
llm_responses = {
    "Google": "Title: Jimmy Kimmel Reacts to Donald Trump Winning the Presidential ... Snippet: Nov 6, 2024 ...",
    "Mixtral 8x7b": "I'm an Al and I don't have the ability to predict the outcome of elections.",
    "SafeChat": "Sorry, I am designed not to answer such a question.",
}
reference_answer = "Sorry, I am unable to answer such a question as it is not appropriate."
# Alternatively, if reference_answer is None, the response from the first model ("Google") will be used:
# reference_answer = None

# 1. Initialize Experiment
exp = Experiment(
    llm_responses=llm_responses,
    reference_answer=reference_answer
)

# 2. Compare models using specific metrics
#   This will calculate scores for 'Jaccard' and 'ROUGE',
#   generate a plot (e.g., radar plot for multiple metrics/models),
#   and save a CSV report.
results_df = exp.compare(
    metrics=['Jaccard', 'ROUGE'],  # Specify metrics, or None for all defaults
    plot=True,
    output_csv_path="experiment_report.csv",
    custom_thresholds={"Jaccard": 0.6, "ROUGE_rouge1": 0.35} # Optional: override default thresholds
)

# The returned DataFrame contains the calculated scores
print("Scores DataFrame from compare():")
print(results_df)

# 3. Get a summary of results (e.g., mean scores and pass rates)
summary_df = exp.summarize(metrics=['Jaccard', 'ROUGE'], custom_thresholds={"Jaccard": 0.6, "ROUGE_rouge1": 0.35})
print("\nSummary DataFrame:")
print(summary_df)

This abstraction streamlines common evaluation tasks, while still allowing access to the underlying metric classes and dataframes for more advanced or customized use cases. More details in examples/quickstart.ipynb.

Scope and Dataset Evaluation

The Experiment class is designed for evaluating a set of model responses against a single reference answer at a time, which is ideal for analyzing outputs for a specific prompt or scenario.

Warning

If reference_answer is not provided (i.e., set to None), GAICo will automatically use the response from the first model in the llm_responses dictionary as the reference. A warning message will be printed in your console to indicate this behavior.

Note

To evaluate a full dataset (e.g., list_of_references, list_of_model_A_generations), you have two main options:

  1. Iterate with Experiment: Loop through your dataset and create a new Experiment instance for each reference text and its corresponding model responses.
  2. Use Metric Classes Directly: For more control, use individual metric classes (e.g., JaccardSimilarity().calculate(list_of_gens, list_of_refs)), which support list inputs for batch processing.

See the examples/ directory for notebooks demonstrating both approaches.

Sample Radar Chart showing multiple metrics for a single LLM
Example Radar Chart generated by the examples/example-2.ipynb notebook.

Tip

Want to add your own metric? Check out our guide in the FAQ.

Description

At its core, the library provides a set of metrics for evaluating various types of outputs—from plain text strings to structured data like planning sequences and time-series, and multimedia content such as images and audio. While the Experiment class streamlines evaluation for text-based and structured string outputs, individual metric classes offer direct control for all data types, including binary or array-based multimedia. These metrics produce normalized scores (typically 0 to 1), where 1 indicates a perfect match, enabling robust analysis and visualization of LLM performance.

Class Structure: All metrics are implemented as extensible classes inheriting from BaseMetric. Each metric requires just one method: calculate().

The calculate() method takes two main parameters:

  • generated_texts: A single generated output or an iterable (list, numpy array, etc.) of outputs.
  • reference_texts: A single reference output or an iterable of outputs.

Important

Handling Missing References: If reference_texts is None or empty, GAICo will automatically use the first item from generated_texts as the reference for comparison. A warning will be printed to the console.

Note

Batch Processing: When you provide iterables as input, calculate() assumes a one-to-one mapping between generated and reference items. If a single reference is provided for multiple generated items, it will be broadcasted for comparison against each one.

Note

Optional Dependencies: The standard pip install gaico is lightweight. Some metrics with heavy dependencies (like BERTScore or JSDivergence) require optional installation.

Inspiration: The design and evaluation metrics are inspired by Microsoft's article on evaluating LLM-generated content. GAICo currently focuses on reference-based metrics.

GAICo Overview

Overview of the workflow supported by the GAICo library

Table of Contents

Features

  • Comprehensive Metric Library:
    • Textual Similarity: Jaccard, Cosine, Levenshtein, Sequence Matcher.
    • N-gram Based: BLEU, ROUGE, JS Divergence.
    • Semantic Similarity: BERTScore.
    • Structured Data: Specialized metrics for planning sequences (PlanningLCS, PlanningJaccard) and time-series data (TimeSeriesElementDiff, TimeSeriesDTW).
    • Multimedia: Metrics for image similarity (ImageSSIM, ImageAverageHash, ImageHistogramMatch) and audio quality (AudioSNRNormalized, AudioSpectrogramDistance).
  • Streamlined Evaluation Workflow:
    • A high-level Experiment class to easily compare multiple models, apply thresholds, generate plots, and create CSV reports.
  • Enhanced Reporting:
    • A summarize() method for quick, aggregated overviews of model performance, including mean scores and pass rates.
  • Dynamic Metric Registration:
    • Easily extend the Experiment class by registering your own custom BaseMetric implementations at runtime.
  • Powerful Visualization:
    • Generate bar charts and radar plots to compare model performance using Matplotlib and Seaborn.
  • Efficient & Flexible:
    • Supports batch processing for efficient computation on datasets.
    • Optimized for various input types (lists, NumPy arrays, Pandas Series).
    • Easily extensible architecture for adding new custom metrics.
  • Robust and Reliable:
    • Includes a comprehensive test suite using Pytest.

Installation

Important

We strongly recommend using a Python virtual environment to manage dependencies and avoid conflicts with other packages.

GAICo can be installed using pip.

  • Create and activate a virtual environment (e.g., named gaico-env):

      # For Python 3.10+
      python3 -m venv gaico-env
      source gaico-env/bin/activate  # On macOS/Linux
      # gaico-env\Scripts\activate   # On Windows
  • Install GAICo: Once your virtual environment is active, install GAICo using pip:

      pip install gaico

This installs the core GAICo library.

Using GAICo with Jupyter Notebooks/Lab

If you plan to use GAICo within Jupyter Notebooks or JupyterLab (recommended for exploring examples and interactive analysis), install them into the same activated virtual environment:

# (Ensure your 'gaico-env' is active)
pip install notebook  # For Jupyter Notebook
# OR
# pip install jupyterlab # For JupyterLab

Then, launch Jupyter from the same terminal where your virtual environment is active:

# (Ensure your 'gaico-env' is active)
jupyter notebook
# OR
# jupyter lab

New notebooks created in this session should automatically use the gaico-env Python environment. For troubleshooting kernel issues, please see our FAQ document.

Optional Installations

The default pip install gaico is lightweight. Some metrics require extra dependencies, which you can install as needed.

  • To include Audio metrics (requires SciPy and SoundFile):
    pip install 'gaico[audio]'
  • To include the BERTScore metric (which has larger dependencies like PyTorch):
    pip install 'gaico[bertscore]'
  • To include the CosineSimilarity metric (requires scikit-learn):
    pip install 'gaico[cosine]'
  • To include the JSDivergence metric (requires SciPy and NLTK):
    pip install 'gaico[jsd]'
  • To install with all optional features:
    pip install 'gaico[audio,bertscore,cosine,jsd]'

Tip

The dev extra, used for development installs, also includes all optional features.

Installation Size Comparison

The following table provides an estimated overview of the relative disk space impact of different installation options. Actual sizes may vary depending on your operating system, Python version, and existing packages. These are primarily to illustrate the relative impact of optional dependencies.

Note: Core dependencies include: levenshtein, matplotlib, numpy, pandas, rouge-score, and seaborn.

Installation Command Dependencies Estimated Total Size Impact
pip install gaico Core 215 MB
pip install 'gaico[audio]' Core + scipy, soundfile 330 MB
pip install 'gaico[bertscore]' Core + bert-score (includes torch, transformers, etc.) 800 MB
pip install 'gaico[cosine]' Core + scikit-learn 360 MB
pip install 'gaico[jsd]' Core + scipy, nltk 310 MB
pip install 'gaico[audio,jsd,cosine,bertscore]' Core + all dependencies from above 1.0 GB

For Developers (Installing from source)

If you want to contribute to GAICo or install it from source for development:

  1. Clone the repository:

    git clone https://github.com/ai4society/GenAIResultsComparator.git
    cd GenAIResultsComparator
  2. Set up a virtual environment and install dependencies:

    We recommend using UV for fast environment and dependency management.

    # Create a virtual environment (Python 3.10-3.12 recommended)
    uv venv
    # Activate the environment
    source .venv/bin/activate  # On Windows: .venv\Scripts\activate
    # Install in editable mode with all development dependencies
    uv pip install -e ".[dev]"

    If you prefer not to use uv, you can use pip:

    # Create a virtual environment (Python 3.10-3.12 recommended)
    python3 -m venv .venv
    # Activate the environment
    source .venv/bin/activate  # On Windows: .venv\Scripts\activate
    # Install the package in editable mode with development extras
    pip install -e ".[dev]"

    The dev extra installs GAICo with all optional features, plus dependencies for testing, linting, and documentation.

  3. Set up pre-commit hooks (recommended for contributors):

    Pre-commit hooks help maintain code quality by running checks automatically before you commit.

    pre-commit install

Project Structure

The project structure is as follows:

.
├── README.md
├── LICENSE
├── .gitignore
├── uv.lock
├── pyproject.toml
├── project_macros.py        # Used by mkdocs-macros-plugin (documentation)
├── PYPI_DESCRIPTION.MD      # The PyPI description file
├── .pre-commit-config.yaml  # Pre-Commit Hooks
├── .mkdocs.yml              # Configuration for mkdocs (documentation)
├── gaico/                   # Contains the library code
├── examples/                # Contains example scripts
├── tests/                   # Contains test
├── scripts/                 # Contains scripts for github deployment and markdown generation
├── docs/                    # Contains documentation files
└── .github/workflows/       # Contains workflows for deploying to PyPI and the documentations site.

Code Style

We use pre-commit hooks to maintain code quality and consistency. The configuration for these hooks is in the .pre-commit-config.yaml file. These hooks run automatically on git commit, but you can also run them manually:

pre-commit run --all-files

Running Tests

Navigate to the project root and use uv to run the test suite:

# Run all tests
uv run pytest

# For more verbose output
uv run pytest -v

# If pytest gives import errors:
uv run -m pytest

Tip

Targeting Specific Tests: You can run or skip tests based on markers. For example, the BERTScore tests are marked as bertscore because they can be slow.

# Skip the slow BERTScore tests
uv run pytest -m "not bertscore"

# Run ONLY the BERTScore tests
uv run pytest -m bertscore

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/FeatureName)
  3. Commit your changes (git commit -m 'Add some FeatureName')
  4. Push to the branch (git push origin feature/FeatureName)
  5. Open a Pull Request

Please ensure that your code passes all tests and adheres to our code style guidelines (enforced by pre-commit hooks) before submitting a pull request.

Citation

If you find this project useful, please consider citing it in your work:

@software{AI4Society_GAICo_GenAI_Results,
  author = {{Nitin Gupta, Pallav Koppisetti, Kausik Lakkaraju, Biplav Srivastava}},
  license = {MIT},
  title = {{GAICo: GenAI Results Comparator}},
  year = {2025},
  url = {https://github.com/ai4society/GenAIResultsComparator}
}

Acknowledgments

  • The library is developed by Nitin Gupta, Pallav Koppisetti, Kausik Lakkaraju, and Biplav Srivastava. Members of AI4Society contributed to this tool as part of ongoing discussions. Major contributors are credited.
  • This library uses several open-source packages including NLTK, scikit-learn, and others. Special thanks to the creators and maintainers of the implemented metrics.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contact

If you have any questions, feel free to reach out to us at ai4societyteam@gmail.com.

About

A Python library providing evaluation metrics to compare generated texts from LLMs, often against reference texts. Features streamlined workflows for model comparison and visualization.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages