Skip to content

fkeck/gpt_interactions

Repository files navigation

Code and data for the manuscript

This repository provides R code and data to reproduce results and figures from the manuscript:

Keck, F. et al. Extracting massive ecological data on state and interactions of species using large language models (2025).

System requirements

The code was developed and tested on Linux. Some tools used are specific to this platform. The details about the R environment on which the code has been tested is fully described in /session_info.txt.

Installation

Install R packages from CRAN and GitHub:

install.packages(c("tidyverse", "openalexR", "httr2", "dplyr",
"readr", "tidyr", "curl", "jsonlite", "stringdist", "cli",
"R.utils", "tidygraph", "ggraph", "patchwork", "tidyheatmaps",
"RColorBrewer", "pheatmap", "igraph", "htmltools",
"xml2", "magrittr", "taxize", "pluralize", "stringr", "wikitaxa",
"rvest"))

remotes::install_github("fkeck/flexitarian")

Using the package's binaries provided by CRAN, installation takes a few minutes on standard computer. The versions of the packages used to generate the results of the manuscript are provided in /session_info.txt.

Python 3.10+ and the libraries spaCy and TaxoNERD with the model en_core_eco_biobert_weak are needed to perform NER.

Linux commands tar and jq are also required.

Data

Data can be found in the /data directory. The main data file (data/save_R_gdata_2.csv.tar.gz) must be uncompressed to run the analyses. The raw text content of the processed publications can be found on PubMed OA. The scripts to download, preprocess and generate the raw and intermediate files are in the R/ directory.

Guide to Reproduce the Analyses

The analysis is organized into modular R scripts located in the /R directory. To reproduce the results of the manuscript, we recommend the following workflow:

  1. Prepare the data
    • Unpack the main dataset:

      tar -xvzf data/save_R_gdata_2.csv.tar.gz
    • Ensure that all required R packages (see above) and Python tools (spaCy + TaxoNERD) are installed and available.

  2. Run the scripts in order
    • The numbered R scripts (01_*.R, 02_*.R, etc.) are organized to follow the logical flow of the analysis.
    • You can run each script independently in an R session, but note that some scripts require data generated by previous scripts, so the order is important.
  3. Reproducing the figures
    • Output figures are generated in the /Figures directory during script execution.

Note 1: Due to API usage limits and costs, the scripts do not include calls to the OpenAI API for interaction extraction. However, all input texts and extracted outputs used in this study are provided in the /data folder and the prompts are dynamically generated by the provided scripts.

Note 2: The complete execution can take several hours.

Indexing

GloBI Review by Elton GloBI

This dataset is configured to be indexed by Global Biotic Interactions (GloBI, https://globalbioticinteractions.org).

About

Data and code for the manuscript

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages