Skip to content

Erythroxylum/herbarium-spectra

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Herbarium-Spectra: Codes for spectral data processing, trait prediction, and taxonomic classification

Replication scripts for the manuscript, Seeing herbaria in a new light: leaf reflectance spectroscopy unlocks predictive trait and classification modeling in plant biodiversity collections published in New Phytologist, 04 July 2025
📖 Access the paper: https://doi.org/10.1111/nph.70357

Authors: Dawson M. White, Jeannine Cavender-Bares, Charles C. Davis, J. Antonio Guzmán Q., Shan Kothari, Jorge M. Robles, José Eduardo Meireles

The following sources provided funding essential to this project: Harvard University Herbaria, NSF BII DBI-2021898, NSF REU-2150058

Use of this code is restricted to non-commercial, academic purposes.
For other use cases, please contact Erythroxylum.


How to Use

To replicate the analyses in this repository:

  1. Clone the repository using Git or by downloading the ZIP file from GitHub.
  2. Create a folder named data/ in the root directory of the repository.
  3. Download the dataset from the Plant Diversity and Ecology collection in Harvard Dataverse:
    https://doi.org/10.7910/DVN/LXPHBC
    The folder titled DMWhiteHUHspec1_processed_spectra_files contains six files, comprising the reflectance spectra, vector-normalized spectra, and continuous wavelet transformed spectra for both the herbarium dataset and for the pressed leaf dataset.
  4. Copy the downloaded spectra filesa into the data/ folder you created.

Repository Structure

  • herbarium-spectra.Rproj - An R project file with directory organization.
  • R/ — Main analysis scripts for trait prediction and taxonomic classification analyses
  • auxiliary/ — Supporting functions and utilities called by scripts in R/
  • plotting/ — Scripts for visualizing results
  • phylogram_pd_TimeTree5.tre — A time-calibrated phylogeny of the 25 study species. This tree is generated in the R/phylogenetic_distance.R script and is called to generate Fig. 5.

Analysis Workflow

Plot spectra

plotting/plotting_functions_spectra.R Plot means, quartiles, coefficient of variation of spectra (Fig. S3)

Data processing

R/Spectrolab_process_data.R

Scripts for combining spectra and metadata files, vector normalization, metadata manipulation

R/spectra_transform.R

Scripts for FWHM band resampling, band trimming, and continuous wavelet transformations.

Trait Estimation

The main script for trait estimation is:

R/leaf_trait_prediction.R

This script takes as input either the Kothari dataset or the herbarium dataset, sets the metadata accordingly, and runs the auxiliary functions to estimate traits and generate output files for analysis.

Data Output

The script generates the following files:

  • pls_*_split.rds Testing and validation sample splits.
  • pls_*_segments.rds Dataset iterations selecting random spectra per individual, based on the accession metadata column.
  • *_opt_comp.csv.txt Optimal number of components calculated using the one-sigma method from the model_tune.R function.
  • *plot.pdf Plots showing ncomp optimization evaluation metrics (PRESS, R^2, RMSEP).
  • pls_*_opt_comp_models.csv Cross-validation models from model_tune.
  • *performance.csv Table containing statistics across accessions (nsamples, R², Bias, RMSE, %RMSE, Intercept, Slope)
  • *obs-pred.csv Data for plotting observed vs. predicted trait values and sample metadata.
  • pls_*_coefficients.csv Model coefficients across segments.
  • pls_*_vip.csv Variable importance in projection (VIP) scores across segments.
    Optional:
  • pls_*_final_models.rds Final models across segments generated by the model_build.R function. Not written by default.

Trait prediction from coefficients

Script to predict LMA from coefficients derived from different band regions or transformed spectra and output performance metrics (for Fig. 3, Table S1, & Table 4).

R/prediction_from_coefficients_LMA.R

Script to predict values for other traits modeled by Kothari et al. 2023 (for Fig. 4 & Table S2)

R/prediction_from_coefficients_other_traits.R

Plotting scripts

  • plotting/plotting_functions_traits.R

Reflectance, coefficients and VIP plots (Fig. 2), Trait observed vs predicted biplots (Fig. 3), violin plots of predicted vs observed trait values (Fig. 4)

Taxonomic classification

The main scripts for trait estimation is:

R/leaf_classification_plsda.R R/leaf_classification_lda.R

These scripts take the herbarium dataset as input, set the metadata accordingly, and run the auxiliary functions to classify at different taxonomic levels using plsda or lda. This can be redirected to classify at the genus or other taxonomic level, or to subset species for direct comparison with the 10 spp. PLS-DA model in Kothari et al. 2023.

Data Output

The scripts generate the following files:

  • classification_split.rds Testing and validation sample splits.
  • classification_segments.rds Dataset iterations selecting random spectra per individual, based on the accession metadata column.
  • ncomp_plsda.txt Printed output of components at max highest accuracy, one-sigma method, and value input for final model.
  • opt_comp_models_plsda.csv.csv Cross-validation models from model_tune.
  • classification_*_models.rds Final models across segments generated by the auxiliary/model_build.R function. Not printed by default
  • performance_*.csv Table containing performance statistics across accessions.
  • coefficients_*.rds Model coefficients across segments.
  • varImp_*.csv Variable importance in projection (VIP) scores across segments.
  • CM_*.rds Data necessary for constructing confusion matrices using plotting/plotting_functions_classification.R.

Plotting scripts

  • plotting/plotting_functions_classification.R

Classification performance (Table 5), Confusion matrices (Fig. 5), and VIP plots (Figs. S4, S5).

Analyses of herbarium factors on classification model success

  • R/phylogenetic_distance.R

Phylogenetic diversity metrics and nearest taxon distances are calculated and table is output to metadata using the phylogenetic distances script.

  • R/prediction_classification_analysis-factors_plotting.R

Scripts for predicting classes based on PLS-DA models generated with R/leaf_classification_plsda.R. The script then contains various plotting functions for the analyses of herbarium factors on classification probabilities and accuracy (Figs. 6 & 7, Table 6, and SI materials).

About

Trait estimation and taxonomic discrimination of herbarium leaves with spectra.

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •  

Languages