Skip to content
/ curie Public

Code release for "CURIE: Evaluating LLMs On Multitask Scientific Long Context Understanding and Reasoning", ICLR 2025

License

Notifications You must be signed in to change notification settings

google/curie

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

16 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

CURIE: Evaluating LLMs On Multitask Scientific Long Context Understanding and Reasoning

Evaluation Code accompanying the paper

CURIE: Evaluating LLMs On Multitask Scientific Long Context Understanding and Reasoning
ICLR 2025
Paper | Poster | Slides

TL;DR: we introduce CURIE (Scientific Long Context Understanding Reasoning and Information Extraction), benchmark with 10 tasks from 6 science domains specifically designed to test the ability of LLMs to assist scientists in realistic workflows.

CURIE benchmark encompasses 10 tasks, with a total of 580 input and solution pairs based on 429 research documents across six
diverse scientific disciplines: materials science, theoretical condensed matter physics, quantum computing, geospatial analysis, biodiversity, and proteins – covering both experimental and theoretical aspects of scientific research. The average length of the input queries in CURIE is about 15k words, and the ground truth responses contain on average 954 words.

(a) CURIE benchmark encompasses 10 tasks, with a total of 580 input and solution pairs based on 429 research documents across six diverse scientific disciplines: materials science, theoretical condensed matter physics, quantum computing, geospatial analysis, biodiversity, and proteins – covering both experimental and theoretical aspects of scientific research. (b) The average length of the input queries in CURIE is about 15k words, and (c) the ground truth responses contain on average 954 words.

πŸ—„οΈ Data

Our data is organized into eight domain-specific subfolders: "biogr", "dft", "pdb", "geo", "mpve", "qecc_65", "hfd", and "hfe". Each subfolder contains two further subfolders: "ground_truth" and "inputs". Within these, each data instance is stored in a JSON file named record_id.json, where record_id is a unique identifier. The "biogr" domain also includes image inputs as record_id.png files alongside the corresponding JSON.

data
    β”œβ”€β”€ domain
        β”œβ”€β”€ inputs
        β”‚   └── record_id.json
        └── ground_truth
            └── record_id.json
    └── difficulty_levels.json

Ground truth data varies in structure and content across domains, but all files consistently include a record_id field matching the filename. Input files have a uniform structure across all domains, containing both a record_id field and a text field representing the input text to LLMs.

For the "biogr" (geo-referencing) task, for 114 of the 138 examples, we release additional data including the PDF papers that each image was taken from along with other metadata in this Github repo: https://github.com/google-research/ecology-georeferencing

πŸ§ͺ Running Inference.

Example Colab notebook hat runs inference by iterating over all examples and prompts for all tasks is provided at code/curie_inference.ipynb. To execute it: Add your API key for the model. Connect to the default runtime ("Python 3 Google Compute Engine backend"). In the "params" cell, configure the following: root_path: Path to the data folder.

πŸ§ͺ Running eval.

Our evaluation Colab notebook is provided at code/curie_run_eval.ipynb. To execute it: Connect to the default runtime ("Python 3 Google Compute Engine backend"). In the "params" cell, configure the following: root_path: Path to the data folder. domain: The target domain (e.g., "biogr", "dft"). llm: The Large Language Model to evaluate. prompt: The prompt used for the LLM. record_id: The ID of the record to evaluate. Run the Colab. Evaluation metrics will be printed at the end of the notebook.

Note: Evaluating the "dft" and "mpve" tasks using the LLMSim score requires querying LLMs and therefore requires setting up a Google API key.

πŸ“Š Generating tables and plots.

To generate the tables and plots in the paper use the notebook code/curie_generate_tables_figures.ipynb

πŸ“ TODOs

  • Release responses by baselines to fully reproduce the reported numbers.
  • Add folder with data.
  • Update evals to include all metrics.
  • Example Colab to run inference.
  • Colab to run evaluation.
  • Colab to generate all plots and tables.

βœ‰οΈ Contact

This repository is created and maintained by Subhashini. Questions and discussions are welcome under issues.

πŸ™ Acknowledgements

We are grateful to the many domain experts who have contributed to the creation of the benchmark and evaluations.

πŸ“„ License

Code in this Github repository is licensed under a APACHE 2.0 License.

πŸŽ“ Citing CURIE

@inproceedings{cui2025curie,
  title={CURIE: Evaluating LLMs on Multitask Scientific Long-Context Understanding and Reasoning},
  author={Cui, Hao and Shamsi, Zahra and Cheon, Gowoon and Ma, Xuejian and Li, Shutong and Tikhanovskaya, Maria and Norgaard, Peter Christian and Mudur, Nayantara and Plomecka, Martyna Beata and Raccuglia, Paul and others},
  booktitle={The Thirteenth International Conference on Learning Representations}
  year={2025}
}

This is not an officially supported Google product.

About

Code release for "CURIE: Evaluating LLMs On Multitask Scientific Long Context Understanding and Reasoning", ICLR 2025

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published