Code and data for the manuscript

This repository provides R code and data to reproduce results and figures from the manuscript:

Keck, F. et al. Extracting massive ecological data on state and interactions of species using large language models (2025).

System requirements

The code was developed and tested on Linux. Some tools used are specific to this platform. The details about the R environment on which the code has been tested is fully described in /session_info.txt.

Installation

Install R packages from CRAN and GitHub:

install.packages(c("tidyverse", "openalexR", "httr2", "dplyr",
"readr", "tidyr", "curl", "jsonlite", "stringdist", "cli",
"R.utils", "tidygraph", "ggraph", "patchwork", "tidyheatmaps",
"RColorBrewer", "pheatmap", "igraph", "htmltools",
"xml2", "magrittr", "taxize", "pluralize", "stringr", "wikitaxa",
"rvest"))

remotes::install_github("fkeck/flexitarian")

Using the package's binaries provided by CRAN, installation takes a few minutes on standard computer. The versions of the packages used to generate the results of the manuscript are provided in /session_info.txt.

Python 3.10+ and the libraries spaCy and TaxoNERD with the model en_core_eco_biobert_weak are needed to perform NER.

Linux commands tar and jq are also required.

Data

Data can be found in the /data directory. The main data file (data/save_R_gdata_2.csv.tar.gz) must be uncompressed to run the analyses. The raw text content of the processed publications can be found on PubMed OA. The scripts to download, preprocess and generate the raw and intermediate files are in the R/ directory.

Guide to Reproduce the Analyses

The analysis is organized into modular R scripts located in the /R directory. To reproduce the results of the manuscript, we recommend the following workflow:

Prepare the data
- Unpack the main dataset:
```
tar -xvzf data/save_R_gdata_2.csv.tar.gz
```
- Ensure that all required R packages (see above) and Python tools (spaCy + TaxoNERD) are installed and available.
Run the scripts in order
- The numbered R scripts (01_*.R, 02_*.R, etc.) are organized to follow the logical flow of the analysis.
- You can run each script independently in an R session, but note that some scripts require data generated by previous scripts, so the order is important.
Reproducing the figures
- Output figures are generated in the /Figures directory during script execution.

Note 1: Due to API usage limits and costs, the scripts do not include calls to the OpenAI API for interaction extraction. However, all input texts and extracted outputs used in this study are provided in the /data folder and the prompts are dynamically generated by the provided scripts.

Note 2: The complete execution can take several hours.

Indexing

This dataset is configured to be indexed by Global Biotic Interactions (GloBI, https://globalbioticinteractions.org).

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.github/workflows		.github/workflows
R		R
data		data
.gitignore		.gitignore
2024_gpt_interactions.Rproj		2024_gpt_interactions.Rproj
README.md		README.md
globi.json		globi.json
session_info.txt		session_info.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Code and data for the manuscript

System requirements

Installation

Data

Guide to Reproduce the Analyses

Indexing

About

Uh oh!

Releases

Packages

Uh oh!

Languages

fkeck/gpt_interactions

Folders and files

Latest commit

History

Repository files navigation

Code and data for the manuscript

System requirements

Installation

Data

Guide to Reproduce the Analyses

Indexing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages