Herbarium-Spectra: Codes for spectral data processing, trait prediction, and taxonomic classification
Replication scripts for the manuscript, Seeing herbaria in a new light: leaf reflectance spectroscopy unlocks predictive trait and classification modeling in plant biodiversity collections published in New Phytologist, 04 July 2025
📖 Access the paper: https://doi.org/10.1111/nph.70357
Authors: Dawson M. White, Jeannine Cavender-Bares, Charles C. Davis, J. Antonio Guzmán Q., Shan Kothari, Jorge M. Robles, José Eduardo Meireles
The following sources provided funding essential to this project: Harvard University Herbaria, NSF BII DBI-2021898, NSF REU-2150058
Use of this code is restricted to non-commercial, academic purposes.
For other use cases, please contact Erythroxylum.
To replicate the analyses in this repository:
- Clone the repository using Git or by downloading the ZIP file from GitHub.
- Create a folder named
data/in the root directory of the repository. - Download the dataset from the Plant Diversity and Ecology collection in Harvard Dataverse:
https://doi.org/10.7910/DVN/LXPHBC
The folder titled DMWhiteHUHspec1_processed_spectra_files contains six files, comprising the reflectance spectra, vector-normalized spectra, and continuous wavelet transformed spectra for both the herbarium dataset and for the pressed leaf dataset. - Copy the downloaded spectra filesa into the
data/folder you created.
herbarium-spectra.Rproj- An R project file with directory organization.R/— Main analysis scripts for trait prediction and taxonomic classification analysesauxiliary/— Supporting functions and utilities called by scripts inR/plotting/— Scripts for visualizing resultsphylogram_pd_TimeTree5.tre— A time-calibrated phylogeny of the 25 study species. This tree is generated in theR/phylogenetic_distance.Rscript and is called to generate Fig. 5.
plotting/plotting_functions_spectra.R Plot means, quartiles, coefficient of variation of spectra (Fig. S3)
R/Spectrolab_process_data.R
Scripts for combining spectra and metadata files, vector normalization, metadata manipulation
R/spectra_transform.R
Scripts for FWHM band resampling, band trimming, and continuous wavelet transformations.
The main script for trait estimation is:
R/leaf_trait_prediction.R
This script takes as input either the Kothari dataset or the herbarium dataset, sets the metadata accordingly, and runs the auxiliary functions to estimate traits and generate output files for analysis.
The script generates the following files:
pls_*_split.rdsTesting and validation sample splits.pls_*_segments.rdsDataset iterations selecting random spectra per individual, based on theaccessionmetadata column.*_opt_comp.csv.txtOptimal number of components calculated using the one-sigma method from themodel_tune.Rfunction.*plot.pdfPlots showing ncomp optimization evaluation metrics (PRESS, R^2, RMSEP).pls_*_opt_comp_models.csvCross-validation models frommodel_tune.*performance.csvTable containing statistics across accessions (nsamples, R², Bias, RMSE, %RMSE, Intercept, Slope)*obs-pred.csvData for plotting observed vs. predicted trait values and sample metadata.pls_*_coefficients.csvModel coefficients across segments.pls_*_vip.csvVariable importance in projection (VIP) scores across segments.
Optional:pls_*_final_models.rdsFinal models across segments generated by themodel_build.Rfunction. Not written by default.
Script to predict LMA from coefficients derived from different band regions or transformed spectra and output performance metrics (for Fig. 3, Table S1, & Table 4).
R/prediction_from_coefficients_LMA.R
Script to predict values for other traits modeled by Kothari et al. 2023 (for Fig. 4 & Table S2)
R/prediction_from_coefficients_other_traits.R
plotting/plotting_functions_traits.R
Reflectance, coefficients and VIP plots (Fig. 2), Trait observed vs predicted biplots (Fig. 3), violin plots of predicted vs observed trait values (Fig. 4)
The main scripts for trait estimation is:
R/leaf_classification_plsda.R
R/leaf_classification_lda.R
These scripts take the herbarium dataset as input, set the metadata accordingly, and run the auxiliary functions to classify at different taxonomic levels using plsda or lda. This can be redirected to classify at the genus or other taxonomic level, or to subset species for direct comparison with the 10 spp. PLS-DA model in Kothari et al. 2023.
The scripts generate the following files:
classification_split.rdsTesting and validation sample splits.classification_segments.rdsDataset iterations selecting random spectra per individual, based on theaccessionmetadata column.ncomp_plsda.txtPrinted output of components at max highest accuracy, one-sigma method, and value input for final model.opt_comp_models_plsda.csv.csvCross-validation models frommodel_tune.classification_*_models.rdsFinal models across segments generated by theauxiliary/model_build.Rfunction. Not printed by defaultperformance_*.csvTable containing performance statistics across accessions.coefficients_*.rdsModel coefficients across segments.varImp_*.csvVariable importance in projection (VIP) scores across segments.CM_*.rdsData necessary for constructing confusion matrices usingplotting/plotting_functions_classification.R.
plotting/plotting_functions_classification.R
Classification performance (Table 5), Confusion matrices (Fig. 5), and VIP plots (Figs. S4, S5).
R/phylogenetic_distance.R
Phylogenetic diversity metrics and nearest taxon distances are calculated and table is output to metadata using the phylogenetic distances script.
R/prediction_classification_analysis-factors_plotting.R
Scripts for predicting classes based on PLS-DA models generated with R/leaf_classification_plsda.R.
The script then contains various plotting functions for the analyses of herbarium factors on classification probabilities and accuracy (Figs. 6 & 7, Table 6, and SI materials).