This repository contains a framework for evaluating the capability of Large Language Models (LLMs) to generate 3D models (specifically OpenSCAD code or direct STL files) from textual descriptions.
- Goal: Assess the reliability and geometric accuracy of various LLMs in producing single, simple mechanical parts based on text prompts, including analysis based on task complexity, execution time, and estimated cost.
- Scope:
- Focus on single-part generation.
- Input is primarily text descriptions defined in task files.
- Evaluation involves automated generation, rendering (where applicable), geometric checks against reference models, and visualization of results via a dashboard.
- Supports standard LLMs (via API like OpenAI, Anthropic, Google) and specialized models like Zoo ML's Text-to-CAD (via CLI).
The evaluation process follows these steps:
- Configuration (
config.yaml): Defines models to test, API keys (sourced from.env), prompt templates, directories, geometry check thresholds, cost parameters (token/time-based), and evaluation parameters (e.g., number of replicates per task). - Task Definition (
tasks/*.yaml): Each task YAML file specifies:task_id: Unique identifier.description: The text prompt for the LLM.reference_stl: Path to the corresponding ground truth STL model inreference/.manual_operations(Optional): An integer indicating the complexity of the task, used for complexity analysis.requirements(Optional): Can includebounding_boxortopology_requirements.
- Execution (
scripts/run_evaluation.py): This is the main script that orchestrates the evaluation:- Loads configuration (
config.yaml) and selected tasks (tasks/). - Iterates through specified models (from
config.yaml), prompts (fromconfig.yaml), and the requested number of replicates. - For standard LLMs: Calls
scripts/generate_scad.pyto interact with LLM APIs, saving generated.scadfiles toresults/{run_id}/scad/. Captures LLM API call duration and token counts. - For
zoo_cli: Calls thezoo ml text-to-cad exportcommand directly, saving the output.stltoresults/{run_id}/stl/. Captures command execution duration. - Rendering: For successfully generated
.scadfiles, calls OpenSCAD viascripts/render_scad.pyto render them into.stlfiles inresults/{run_id}/stl/. Captures rendering duration. - Geometry Checks: For all successfully generated/rendered
.stlfiles, callsscripts/geometry_check.pyto perform checks against the reference STL. - Aggregates raw results (generation status, render status, check results, metrics, timings, token counts) for each attempt into
results/{run_id}/results_{run_id}.json. Also creates a log fileresults/{run_id}/run_{run_id}.log.
- Loads configuration (
- Post-Processing (
scripts/process_results.py):- Reads the raw
results_{run_id}.jsonfile from a specified run. - Calculates summary statistics (
meta_statisticsgrouped by model/prompt,task_statisticsgrouped by task), including average time and estimated cost metrics. - Calculates complexity analysis based on
manual_operationsfrom task YAMLs. - Formats the processed data and statistics into
dashboard/dashboard_data.json.
- Reads the raw
- Dashboard (
dashboard/):- A web-based dashboard (
dashboard.html,dashboard.js,dashboard.css) readsdashboard_data.json. - Displays interactive charts (using Chart.js) and tables visualizing model performance across various metrics, including average time and cost.
- A web-based dashboard (
CadEval/
├── config.yaml # Main configuration file
├── environment.yml # Conda environment definition (or requirements.txt)
├── .env # API Keys and environment variables (add to .gitignore!)
├── .gitignore # Specifies intentionally untracked files
├── README.md # This file
├── tasks/ # Task definition YAML files
│ └── *.yaml
├── reference/ # Reference/ground truth STL models
│ └── *.stl
├── scripts/ # Core Python scripts
│ ├── run_evaluation.py # Main orchestration script
│ ├── process_results.py # Post-processing for dashboard
│ ├── geometry_check.py # Performs geometric comparisons
│ ├── render_scad.py # Handles OpenSCAD rendering
│ ├── generate_scad.py # Handles LLM API calls for SCAD generation
│ ├── task_loader.py # Loads task definitions
│ ├── config_loader.py # Loads config.yaml
│ └── logger_setup.py # Configures logging
├── results/ # Output directory for evaluation runs
│ └── {run_id}/ # Each run gets its own directory
│ ├── results_{run_id}.json # Raw detailed results for the run, including status, metrics, timings, and token counts.
│ ├── run_{run_id}.log # Log file for the run
│ ├── scad/ # Generated .scad files (for non-Zoo models)
│ │ └── *.scad
│ └── stl/ # Generated/rendered .stl files
│ └── *.stl
├── dashboard/ # Web dashboard files
│ ├── dashboard.html
│ ├── dashboard.js
│ ├── dashboard.css
│ └── dashboard_data.json # Processed data consumed by the dashboard, including summary statistics for metrics, time, and cost.
└── # Other files/dirs (.git, .pytest_cache, tests/, etc.)
-
Clone the Repository:
git clone <repository-url> cd CadEval
-
Create Environment using Conda:
conda env create -f environment.yml conda activate cadeval
-
Install OpenSCAD: Download and install OpenSCAD (version 2021.01 or later recommended) from openscad.org. Ensure it's added to your system's PATH or update the
openscad.executable_pathinconfig.yamlaccordingly. -
Install ZooML CLI (Optional): If evaluating the
zoo-ml-text-to-cadmodel, follow the installation instructions for the Zoo command-line tools. -
Configure API Keys:
- Create a file named
.envin the project root directory (if it doesn't exist). - Add your API keys like this:
OPENAI_API_KEY=your_openai_key ANTHROPIC_API_KEY=your_anthropic_key GOOGLE_API_KEY=your_google_key # Add other keys if needed
- Important: Ensure
.envis listed in your.gitignorefile to avoid committing secrets.
- Create a file named
-
Review
config.yaml:- Verify the
openscad.executable_path. - Add/remove/modify LLM models under
llm.models. Ensure providers match the API clients used inscripts/generate_scad.py. - Configure cost parameters for each model (e.g.,
cost_per_input_token,cost_per_output_tokenfor LLMs;cost_per_minute,free_tier_secondsforzoo_cli). Use accurate pricing data. - Adjust prompt templates under
promptsif desired. - Review geometry check thresholds under
geometry_check. - Set the desired
evaluation.num_replicates.
- Verify the
-
Activate Environment:
conda activate cadeval
-
Run Evaluation: Execute the main orchestration script from the project root directory.
- Run all tasks, models, and the 'default' prompt defined in
config.yaml:python scripts/run_evaluation.py
- Specify tasks, models, prompts, or run ID:
# Run only specific tasks python scripts/run_evaluation.py --tasks task_id_1 task_id_2 # Run only specific models (names must match config.yaml) python scripts/run_evaluation.py --models gpt-4o-mini claude-3-5-sonnet-20240620 # Run using specific prompt template keys (from config.yaml) python scripts/run_evaluation.py --prompts concise default # Assign a custom run ID python scripts/run_evaluation.py --run-id my_test_run_01 # Combine options python scripts/run_evaluation.py --tasks task_id_1 --models o1-2024-12-17 --prompts default --replicates 1 --run-id specific_test
- Output for the run will be created in
results/{run_id}/.
- Run all tasks, models, and the 'default' prompt defined in
-
Process Results for Dashboard: After an evaluation run completes, process its raw results JSON.
# Replace {run_id} with the actual ID of the run you want to process # Example: python scripts/process_results.py --results-path results/20231027_153000/results_20231027_153000.json python scripts/process_results.py --results-path results/{run_id}/results_{run_id}.json
- This generates/updates
dashboard/dashboard_data.json.
- This generates/updates
-
View Dashboard:
- Open the
dashboard/dashboard.htmlfile in your web browser. - The dashboard loads data from
dashboard_data.jsonand displays interactive charts and summary tables. The dashboard now includes charts for average total generation time and estimated cost per model. Refresh the browser page after runningprocess_results.pyto see updated data.
- Open the
-
Run Unit Tests (Optional):
- Ensure you have any development dependencies installed (if applicable, e.g.,
pip install -r requirements-dev.txtif it existed and containedpytest). You might need to install pytest directly:pip install pytest. - Run pytest from the project root directory:
pytest
- Ensure you have any development dependencies installed (if applicable, e.g.,
For significantly faster execution, especially with many tasks or replicates, you can use the parallel evaluation script scripts/run_evaluation_parallel.py. This script performs generation (LLM/Zoo), rendering, and geometry checks concurrently using multiple processes or threads.
- Basic Usage: It accepts the same arguments as
run_evaluation.py(e.g.,--tasks,--models,--run-id).python scripts/run_evaluation_parallel.py --tasks task_id_1 task_id_2 --models gpt-4o-mini
- Controlling Concurrency: You can control the maximum number of parallel workers for different stages using:
--max-workers-llm <N>: Max threads for LLM API calls (Default: 10).--max-workers-zoo <N>: Max processes for Zoo CLI calls (Default: Half CPU cores).--max-workers-render <N>: Max processes for OpenSCAD rendering (Default: CPU cores).--max-workers-check <N>: Max processes for Geometry Checks (Default: CPU cores).- Example: Reduce CPU load
python scripts/run_evaluation_parallel.py --tasks task_id_1 --max-workers-render 4 --max-workers-check 4
- Output: The parallel script produces the same
results/{run_id}/structure, including the finalresults_{run_id}.json, compatible withscripts/process_results.py.
The scripts/geometry_check.py script performs the following evaluations on successfully generated/rendered STL files:
- Watertight: Checks if the mesh is manifold using Open3D.
- Single Component: Verifies the mesh consists of a single connected component (or the number specified in task requirements).
- Bounding Box Accuracy: Compares the dimensions of the generated model's aligned bounding box (using ICP alignment) to the reference model's bounding box within a tolerance (
geometry_check.bounding_box_tolerance_mm). - Volume Accuracy: Compares the volume of the generated model to the reference model's volume within a percentage threshold (
geometry_check.volume_threshold_percent). - Geometric Similarity (Chamfer Distance): Calculates the Chamfer distance between generated and reference point clouds after ICP alignment (
geometry_check.chamfer_threshold_mm). Lower is better. - Hausdorff Distance (95th & 99th Percentile): Calculates percentile Hausdorff distances after alignment (
geometry_check.hausdorff_threshold_mmapplies to the 95p). - ICP Fitness: Reports the alignment quality score from the Iterative Closest Point algorithm.
In addition to geometric checks, the framework now tracks and estimates the time and cost associated with each generation attempt:
-
Time Tracking:
- LLM Generation Time (
llm_duration_seconds): Measures the wall-clock time specifically for the API call to the LLM provider (e.g., OpenAI, Anthropic). - Zoo CLI Generation Time (
generation_duration_seconds): Measures the wall-clock time for executing thezoo ml text-to-cad exportcommand. - Rendering Time (
render_duration_seconds): Measures the wall-clock time for OpenSCAD to render a.scadfile into an.stlfile. - Total Time: Calculated as the sum of generation time (LLM or Zoo) and rendering time (if applicable). The dashboard displays
Average Total Time (Gen + Render).
- LLM Generation Time (
-
Cost Estimation:
- LLM Cost: Estimated based on token counts (
prompt_tokens,completion_tokens) returned by the API and the pricing (cost_per_input_token,cost_per_output_token) defined for the model inconfig.yaml. - Zoo CLI Cost: Estimated based on the execution duration (
generation_duration_seconds) and the pricing (cost_per_minute,free_tier_seconds) defined inconfig.yaml. A configurable free tier duration is subtracted before calculating cost. - The dashboard displays
Average Estimated Costper successful generation/render cycle.
- LLM Cost: Estimated based on token counts (
While this framework provides valuable insights, it's important to be aware of its current limitations:
- Limited Task Coverage: The current evaluation suite consists of 25 tasks. While designed to cover a range of simple operations, this number is insufficient to represent the full spectrum of CAD modeling challenges. The statistical power may be limited for detecting subtle performance differences, although it is sufficient to show larger differences.
- Potential False Negatives (~<5%):
- Alignment Failures: The Iterative Closest Point (ICP) alignment step, which precedes Chamfer and Hausdorff distance calculations, can occasionally fail even for visually similar models. This leads to inaccurate distance metrics and subsequent check failures.
- Metric Sensitivity: Sometimes, distance metrics (like Hausdorff) can be overly sensitive to minor outliers or mesh artifacts, resulting in a failed check even if the core geometry is correct.
- Potential False Positives: The current checks might not reliably detect the absence of small features like minor fillets or chamfers if their omission doesn't significantly impact the overall distance metrics or bounding box.
These limitations highlight areas for future refinement in both the task suite and the checking methodology.
Potential areas for future development and improvement of this evaluation framework include:
- Expanded Task Suite: Increase the number and diversity of tasks to cover a wider range of geometric features, complexities, and potential edge cases.
- Prompt Engineering & Optimization: Investigate the impact of different system prompts, few-shot examples, or providing specific OpenSCAD documentation snippets to improve model performance and reliability.
- Multimodal Input: Extend the framework to evaluate model performance based on visual inputs, such as sketches, technical drawings, or existing images.
- Incremental Modification Tasks:
- Adding Operations: Evaluate the ability of LLMs to add new features (holes, extrusions, etc.) to existing valid OpenSCAD code or STL models based on text instructions.
- Editing/Fixing: Assess how well models can correct errors or modify specific features in existing SCAD code based on feedback or new requirements.
- STL-to-STL Evaluation: Evaluate the direct ability of models (like Zoo ML) to generate an STL file that closely matches a reference STL, potentially using different metrics or comparison techniques.
- Optimize & Track Reasoning Tokens: Initial tests suggest Anthropic's
thinking.budget_tokenssignificantly impacts performance (e.g., 2500 might be better than 10000), requiring further optimization. Additionally, incorporate tracking for reasoning/thinking tokens used by Anthropic, OpenAI (o-series), and potentially Google models to better understand reasoning cost/benefit. - Provider-Specific Rate Limiting: High concurrency (e.g.,
--max-workers-llm 10) can trigger rate limits (HTTP 429 errors) with providers like Anthropic. While reducing workers (e.g.,--max-workers-llm 3) helps, future improvements could include provider-specific worker limits or more sophisticated rate-limiting logic within the LLM clients (e.g., respectingRetry-Afterheaders, token bucket algorithms).
(Refer to git history for older versions of this README.)