R software for robust inference with LLMs. This accompanies this working paper from Bisbee and Spirling:
What to Do When Humans Are No Longer the Gold Standard: Large Language Models, State of the Art and Robustness
The abstract is as follows:
In this short paper, we consider the research implications of large language model (LLM) capabilities approaching, perhaps exceeding, those of highly-trained humans. Specifically, we note that frontier LLMs demonstrate near-expert performance for many data annotation tasks, and they are getting better over time. We show what this will mean for inference in downstream tasks: optimistically, it is that estimated treatment effects will become larger, although claimed null effects may be more dubious. We argue that authors should focus more on sensitivity and robustness with respect to future technological change, and we demonstrate how to use local calibration for such problems. We discuss how our findings, combined with the fact that performance is inherently bounded above (at 100%), should affect debates on the importance of using proprietary “State of the Art” versus open-weight, replicable LLMs. We make available fast and free software (futureProofR) for implementing our suggestions
Comments are very welcome!