This repository contains the official implementation of AlphaSAGE (Structure-Aware Alpha Mining via Generative Flow Networks for Robust Exploration), a novel framework for automated mining of predictive signals (alphas) in quantitative finance.
The automated mining of predictive signals, or alphas, is a central challenge in quantitative finance. While Reinforcement Learning (RL) has emerged as a promising paradigm for generating formulaic alphas, existing frameworks are fundamentally hampered by three interconnected issues:
- Reward Sparsity: Meaningful feedback is only available upon completion of a full formula, leading to inefficient and unstable exploration
- Inadequate Representation: Sequential representations fail to capture the mathematical structure that determines an alpha's behavior
- Mode Collapse: Standard RL objectives drive policies towards single optimal modes, contradicting the need for diverse, non-correlated alpha portfolios
AlphaSAGE addresses these challenges through three cornerstone innovations:
- Relational Graph Convolutional Network (RGCN) based encoder that captures the inherent mathematical structure of alpha expressions
- Preserves semantic relationships between operators and operands
- Enables better understanding of formula behavior and properties
- Replaces traditional RL with GFlowNets for diverse exploration
- Naturally supports multi-modal sampling for generating diverse alpha portfolios
- Avoids mode collapse inherent in standard policy gradient methods
- Provides rich, intermediate feedback throughout the generation process
- Combines multiple evaluation criteria for comprehensive alpha assessment
- Enables more stable and efficient learning compared to sparse reward signals
The overview of AlphaSAGE is shown in the following figure:
Empirical results demonstrate that AlphaSAGE significantly outperforms existing baselines in:
- Diversity: Mining more diverse alpha portfolios
- Novelty: Generating novel alpha expressions
- Predictive Power: Achieving higher predictive performance
- Robustness: Maintaining performance across different market conditions
The backtest results of AlphaSAGE is shown in the following figure:
Detailed results and analysis can be found in our paper.
We use PDM to manage the dependencies. To install PDM, please refer to the official documentation.
git clone https://github.com/BerkinChen/AlphaSAGE.git
cd AlphaSAGE
pdm installWe use the data from Qlib to train the model. Please refer to the official documentation to download the data.
AlphaSAGE operates in two main stages:
Generate a diverse pool of alpha expressions using our structure-aware GFlowNets framework:
python train_gfn.py \
--seed 0 \
--instrument csi300 \
--pool_capacity 50 \
--log_freq 500 \
--update_freq 64 \
--n_episodes 10000 \
--encoder_type gnn \
--entropy_coef 0.01 \
--entropy_temperature 1.0 \
--mask_dropout_prob 1.0 \
--ssl_weight 1.0 \
--nov_weight 0.3 \
--weight_decay_type linear \
--final_weight_ratio 0.0Key Parameters:
--encoder_type gnn: Uses our structure-aware RGCN encoder--pool_capacity 50: Maximum number of alphas to maintain in the pool--entropy_coef 0.01: Controls exploration vs exploitation balance--ssl_weight 1.0: Self-supervised learning weight for structure awareness--nov_weight 0.3: Novelty reward weight for diversity
Following AlphaForge, we use adaptive combination to create the final alpha portfolio:
python run_adaptive_combination.py \
--expressions_file results_dir\
--instruments csi300 \
--threshold_ric 0.015 \
--threshold_ricir 0.15 \
--chunk_size 400 \
--window inf \
--n_factors 20 \
--cuda 2 \
--train_end_year 2020 \
--seed 0 \For comparison with AlphaGen and AlphaQCM, run the following commands:
# Train AlphaQCM
python train_qcm.py \
--instruments csi300 \
--pool 20 \
--seed 0
# Evaluate AlphaQCM results
python run_adaptive_combination.py \
--expressions_file results_dir \
--instruments csi300 \
--cuda 2 \
--train_end_year 2020 \
--seed 0 \
--use_weights True# Train AlphaGen with PPO
python train_ppo.py \
--instruments csi300 \
--pool 20 \
--seed 0
# Evaluate AlphaGen results
python run_adaptive_combination.py \
--expressions_file results_dir \
--instruments csi300 \
--cuda 2 \
--train_end_year 2020 \
--seed 0 \
--use_weights TrueFor AlphaForge and other ML baselines, please refer to the AlphaForge documentation.
This project is licensed under the MIT License - see the LICENSE file for details.
We welcome contributions! Please feel free to submit a Pull Request.