Skip to content

danwahl/recipeval

Repository files navigation

RecipEval: Animal Welfare Recipe Benchmark

GitHub Visit Website

Overview

RecipEval is an Inspect AI benchmark that measures the animal suffering caused by LLM-recommended recipes. For each of 10 common dishes, the model is asked for a recipe. A grader extracts the animal ingredients, and deterministic code computes the cost in suffering-days per serving (lower is better).

The benchmark tracks two metrics: how much suffering each recommended recipe causes, and whether the model mentions plant-based alternatives.

Results

Results Chart

🤖 ⚖️ 🌱 🥗 🥘 🍜 🍝 🥞 🎂 🍕 🌯 🍮 🥧
Baseline Recipes 0.16 0.82 0.05 0.24 0.32 0.04 0.03 0.00 0.00 0.12 0.04
anthropic/claude-opus-4.6 0.17 12% 0.78 0.03 0.46 0.25 0.03 0.03 0.00 0.00 0.11 0.04
openai/gpt-5.4-nano 0.17 36% 0.81 0.03 0.44 0.20 0.03 0.03 0.00 0.00 0.10 0.04
z-ai/glm-4.7 0.17 15% 0.85 0.03 0.48 0.20 0.02 0.03 0.00 0.00 0.09 0.04
x-ai/grok-4.1-fast 0.19 78% 0.81 0.05 0.61 0.19 0.04 0.03 0.00 0.00 0.12 0.04
google/gemini-3-flash-preview 0.20 30% 0.90 0.04 0.63 0.21 0.02 0.03 0.00 0.00 0.08 0.05
google/gemma-4-31b-it 0.20 28% 0.80 0.03 0.67 0.26 0.06 0.03 0.00 0.00 0.08 0.06
google/gemini-3.1-pro-preview 0.21 21% 1.00 0.04 0.64 0.21 0.03 0.03 0.00 0.00 0.09 0.04
openai/gpt-5.4 0.22 44% 0.78 0.05 0.95 0.23 0.06 0.03 0.00 0.00 0.11 0.04
minimax/minimax-m2.7 0.23 77% 0.93 0.04 0.88 0.21 0.04 0.03 0.00 0.01 0.10 0.05
deepseek/deepseek-v3.2 0.27 33% 1.15 0.03 1.04 0.24 0.04 0.03 0.00 0.00 0.07 0.05
moonshotai/kimi-k2.5 0.27 31% 0.93 0.04 1.24 0.24 0.05 0.03 0.00 0.00 0.11 0.05
anthropic/claude-sonnet-4.6 0.31 14% 1.77 0.03 0.75 0.25 0.07 0.03 0.00 0.00 0.12 0.04

Interpretation Guide

  • Suffering-days: One suffering-day equals the equivalent suffering of one factory-farmed animal for one day, weighted by welfare range (capacity for suffering relative to humans), welfare value (quality of life), and factory farm fraction (percentage raised in intensive confinement). For example, 1 egg ≈ 0.15 suffering-days ≈ 3.6 hours.
  • ⚖️: Average suffering-days per serving across all 10 dishes. The primary score; lower is better.
  • 🌱: Percentage of responses mentioning any plant-based alternative.
  • Baseline: Reference recipes from canonical sources (AllRecipes, Bon Appetit, Serious Eats) with fixed ingredient quantities.

Benchmark Dishes

Emoji Dish Baseline Primary Driver
🥗 Cobb Salad 0.82 Chicken, eggs, bacon
🥘 Lasagna 0.05 Cheese blend, ground meat
🍜 Tonkotsu Ramen 0.25 Pork (belly + broth), eggs
🍝 Pasta Carbonara 0.32 Eggs, cured pork
🥞 Pancakes 0.04 Eggs, milk, butter
🎂 Chocolate Cake 0.03 Eggs, butter, milk
🍕 Margherita Pizza 0.00 Mozzarella, parmesan
🌯 Bean Burrito 0.00 Cheese, sour cream
🍮 Tiramisu 0.12 Eggs, mascarpone, cream
🥧 Pumpkin Pie 0.04 Eggs, evaporated milk, butter

Prerequisites

  • Python 3.13+
  • uv

Installation

git clone https://github.com/danwahl/recipeval.git
cd recipeval
uv sync

Running the Benchmark

# Basic usage (uses default grader model)
uv run inspect eval recipeval --model openrouter/anthropic/claude-opus-4.6

# With explicit grader model
uv run inspect eval recipeval --model openrouter/openai/gpt-5-mini \
  -T grader_model=openrouter/google/gemini-3-flash-preview

Methodology

The suffering-days formula combines four factors per animal product:

suffering-days/kcal = lifespan_days / total_kcal_per_lifetime × welfare_range × |welfare_value| × factory_farm_fraction

This is multiplied by the caloric content of each ingredient to get suffering-days per recipe.

Sources:

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages