Skip to content

giuseppegrieco/officeqa

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OfficeQA

A Grounded Reasoning Benchmark by Databricks

OfficeQA is a benchmark by Databricks, built for evaluating model / agent performance on end to end Grounded Reasoning tasks.

Additional details:

  • Questions require the U.S Treasury Bulletin documents to answer
  • OfficeQA contains 246 questions & corresponding ground truth answers.
  • Datasets released under CC-BY-SA 4.0 and code and scripts under Apache 2.0 License.

Overview

OfficeQA evaluates how well AI systems can reason over real-world documents to answer complex questions. The benchmark uses historical U.S. Treasury Bulletin PDFs (1939-2025), which contain dense financial tables, charts, and text data.

Repository Contents:

  • officeqa.csv - The benchmark dataset with 246 questions
  • treasury_bulletin_pdfs/ - Source PDF documents (696 files)
  • reward.py - Evaluation script for scoring model outputs

Dataset Schema (officeqa.csv):

Column Description
uid Unique question identifier
question The question to answer
answer Ground truth answer
source_docs Document(s) required to answer the question
difficulty easy or hard

Getting Started

1. Clone the repository

git clone https://github.com/databricks/officeqa.git
cd officeqa

NOTE: This may take a long time due to the large amount of PDF documents in treasury_bulletin_pdfs

2. Load the dataset

import pandas as pd

df = pd.read_csv('officeqa.csv')
print(f"Total questions: {len(df)}")
print(f"Easy: {len(df[df['difficulty'] == 'easy'])}")
print(f"Hard: {len(df[df['difficulty'] == 'hard'])}")

3. Evaluate your model outputs

from reward import officeqa_reward

# Score a single prediction
score = officeqa_reward(
    ground_truth="123.45",
    prediction="123.45",
    tolerance=0.01  # 1% tolerance for numerical answers
)
print(f"Score: {score}")  # 1.0 for correct, 0.0 for incorrect

Evaluation

The reward.py script provides fuzzy matching for numerical answers with configurable tolerance levels:

  • 0.0% - Exact match
  • 0.1% - Within 0.1% relative error
  • 1.0% - Within 1% relative error
  • 5.0% - Within 5% relative error etc.

About

Repository for getting started with the OfficeQA Benchmark.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 100.0%