⚡ANSR Training on Fully Procedurally Generated Data Inspired by NeSymReS (Biggio et al. 2021)
Symbolic Regression has been approached with many different methods and paradigms. The overwhelming success of transformer-based language models in recent years has since motivated researchers to solve Symbolic Regression with large-scale pre-training of data-conditioned "equation generators" at competitive levels. However, as most traditional methods, the majority of these Amortized Neural Symbolic Regression methods rely on SymPy to simplify and compile randomly generated training equations, a choice that inevitably brings tradeoffs and requires workarounds to efficiently work at scale. I show that replacing SymPy with a novel token-based simplification algorithm with hand-crafted transformation rules enables training on fully-procedurally generated and higher-quality synthetic data, and thus develop ⚡ANSR. On various test sets, my method perfectly recovers
Model Comparison. Up to 3 variables. Default Model Configurations (32 threads / beams).
Bootstrapped Median, 5p, 95p and AR-p (Noreen 1989) values (n=1000).
N = 5000 (⚡ v7.0), 1000 (PySR, NeSymReS 100M).
AMD 9950X (16C32T), RTX 4090.
32GB Memory- CUDA-enabled GPU
12GB VRAM64GB Storage (subject to change)
- Python
$\geq$ 3.11 -
pip$\geq$ 21.3 with PEP 660 (see https://pip.pypa.io/en/stable/news/#v21-3) - (Ubuntu 22.04.3 LTS)
git clone https://github.com/psaegert/flash-ansr
cd flash-ansrCreate a virtual environment (optional):
conda:
conda create -n ansr python=3.11 ipykernel ipywidgets
conda activate ansrThen, install the package via
pip install -e .
pip install -e ./nsropsimport torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Import flash_ansr
from flash_ansr import FlashANSR, GenerationConfig, install_model, get_path
# Specify the model
# Here: https://huggingface.co/psaegert/flash-ansr-v7.0
MODEL = "psaegert/flash-ansr-v7.0"
# Download the latest snapshot of the model
# By default, the model is downloaded to the directory `./models/` in the package root
install_model(MODEL)
# Load the model
ansr = FlashANSR.load(
directory=get_path('models', MODEL),
generation_config=GenerationConfig(method='beam_search', beam_width=32), # optional
n_restarts=32, # optional
).to(device)
# Define data
X = ...
y = ...
# Fit the model to the data
ansr.fit(X, y, verbose=True)
# Show the best expression
print(ansr.get_expression())
# Predict with the best expression
y_pred = ansr.predict(X)Use, copy or modify a config in ./configs:
./configs
├── my_config
│ ├── dataset_train.yaml # Link to skeleton pool and padding for training
│ ├── dataset_val.yaml # Link to skeleton pool and padding for validation
│ ├── evaluation.yaml # Evaluation settings
│ ├── expression_space.yaml # Operators and variables
│ ├── nsr.yaml # Model settings and link to expression space
│ ├── skeleton_pool_train.yaml # Sampling and holdout settings for training
│ ├── skeleton_pool_val.yaml # Sampling and holdout settings for validation
│ └── train.yaml # Data and schedule for training
Run the training and evaluation pipeline with
./scripts/run.sh my_configFor more information see below.
Test data structured as follows:
./data/ansr-data/test_set
├── feynman
│ └── FeynmanEquations.csv
├── nguyen
│ └── nguyen.csv
└── soose_nc
└── nc.csvThe test data can be cloned from the Hugging Face data repository:
git clone https://huggingface.co/psaegert/ansr-data data/ansr-dataExternal datasets must be imported into the ANSR format:
flash_ansr import-data -i "{{ROOT}}/data/ansr-data/test_set/soose_nc/nc.csv" -p "soose" -e "{{ROOT}}/configs/test_set_base/expression_space.yaml" -b "{{ROOT}}/configs/test_set_base/skeleton_pool.yaml" -o "{{ROOT}}/data/ansr-data/test_set/soose_nc/skeleton_pool" -v
flash_ansr import-data -i "{{ROOT}}/data/ansr-data/test_set/feynman/FeynmanEquations.csv" -p "feynman" -e "{{ROOT}}/configs/test_set_base/expression_space.yaml" -b "{{ROOT}}/configs/test_set_base/skeleton_pool.yaml" -o "{{ROOT}}/data/ansr-data/test_set/feynman/skeleton_pool" -v
flash_ansr import-data -i "{{ROOT}}/data/ansr-data/test_set/nguyen/nguyen.csv" -p "nguyen" -e "{{ROOT}}/configs/test_set_base/expression_space.yaml" -b "{{ROOT}}/configs/test_set_base/skeleton_pool.yaml" -o "{{ROOT}}/data/ansr-data/test_set/nguyen/skeleton_pool" -vwith
-
-ithe input file -
-pthe name of the parser implemented in./src/flash_ansr/compat/convert_data.py -
-ethe expression space -
-bthe config of a base skeleton pool to add the data to -
-othe output directory for the resulting skeleton pool -
-vverbose output
This will create and save a skeleton pool with the parsed imported skeletons in the specified directory:
./data/ansr-data/test_set/<test_set>
└── skeleton_pool
├── expression_space.yaml
├── skeleton_pool.yaml
└── skeletons.pklValidation data is generated by randomly sampling according to the settings in the skeleton pool config:
flash_ansr generate-skeleton-pool -c {{ROOT}}/configs/${CONFIG}/skeleton_pool_val.yaml -o {{ROOT}}/data/ansr-data/${CONFIG}/skeleton_pool_val -s 5000 -vwith
-cthe skeleton pool config-othe output directory to save the skeleton pool-sthe number of unique skeletons to sample-vverbose output
flash_ansr train -c {{ROOT}}/configs/${CONFIG}/train.yaml -o {{ROOT}}/models/ansr-models/${CONFIG} -v -ci 100000 -vi 10000with
-cthe training config-othe output directory to save the model and checkpoints-vverbose output-cithe interval to save checkpoints-vithe interval for validation
flash_ansr evaluate -c {{ROOT}}/configs/${CONFIG}/evaluation.yaml -m "{{ROOT}}/models/ansr-models/${MODEL}" -d "{{ROOT}}/data/ansr-data/test_set/soose_nc/dataset.yaml" -n 5000 -o {{ROOT}}/results/evaluation/${CONFIG}/soose_nc.pickle -v
flash_ansr evaluate -c {{ROOT}}/configs/${CONFIG}/evaluation.yaml -m "{{ROOT}}/models/ansr-models/${MODEL}" -d "{{ROOT}}/data/ansr-data/test_set/feynman/dataset.yaml" -n 5000 -o {{ROOT}}/results/evaluation/${CONFIG}/feynman.pickle -v
flash_ansr evaluate -c {{ROOT}}/configs/${CONFIG}/evaluation.yaml -m "{{ROOT}}/models/ansr-models/${MODEL}" -d "{{ROOT}}/data/ansr-data/test_set/nguyen/dataset.yaml" -n 5000 -o {{ROOT}}/results/evaluation/${CONFIG}/nguyen.pickle -v
flash_ansr evaluate -c {{ROOT}}/configs/${CONFIG}/evaluation.yaml -m "{{ROOT}}/models/ansr-models/${MODEL}" -d "{{ROOT}}/configs/${CONFIG}/dataset_val.yaml" -n 5000 -o {{ROOT}}/results/evaluation/${CONFIG}/val.pickle -v
flash_ansr evaluate -c {{ROOT}}/configs/${CONFIG}/evaluation.yaml -m "{{ROOT}}/models/ansr-models/${MODEL}" -d "{{ROOT}}/data/ansr-data/test_set/pool_15/dataset.yaml" -n 5000 -o {{ROOT}}/results/evaluation/${CONFIG}/pool_15.pickle -v
flash_ansr evaluate -c {{ROOT}}/configs/${CONFIG}/evaluation.yaml -m "{{ROOT}}/models/ansr-models/${MODEL}" -d "{{ROOT}}/configs/${CONFIG}/dataset_train.yaml" -n 5000 -o {{ROOT}}/results/evaluation/${CONFIG}/train.pickle -vwith
-cthe evaluation config-mthe model to evaluate-dthe dataset to evaluate on-nthe number of samples to evaluate-othe output file for results-vverbose output
- Clone NeuralSymbolicRegressionThatScales to a directory of your choice.
- Download the
100Mmodel as described here - Move the
100Mmodel intoflash-ansr/models/nesymres/ - Create a Python 3.10 (!) environment and install flash-ansr as in the previous steps.
- Install NeSymReS in the same environment:
cd NeuralSymbolicRegressionThatScales
pip install -e src/
pip install lightning- Navigate back to this repository and run the evaluation
cd flash-ansr
./scripts/evaluate_nesymres <test_set>- Install PySR in the same environment as flash-ansr.
- Run the evaluation
./scripts/evaluate_pysr <test_set>To set up the development environment, run the following commands:
pip install -e .[dev]
pip install -e ./nsrops
pre-commit installTest the package with ./scripts/pytest.sh. Run pylint with ./scripts/pylint.sh.
@software{flash-ansr2024,
author = {Paul Saegert},
title = {Flash Amortized Neural Symbolic Regression},
year = 2024,
publisher = {GitHub},
version = {0.3.0},
url = {https://github.com/psaegert/flash-ansr}
}