This project implements a Reflective Genetic Programming (GP) agent for autonomous machine learning within the ML-Master framework: https://github.com/sjtu-sai-agents/ML-Master .
Inspired by ReEvo ( https://github.com/ai4co/reevo ), we define the genetic operators as follows:
- Crossover: Short-term memory consolidation from current population
- Mutation: Long-term memory recall via global best solutions
The GP agent outperforms baseline MCTS across three MLE-bench tasks, demonstrating superior exploration capabilities and resistance to premature convergence.
- Population-based evolution with intelligent LLM-driven operators
- Crossover operator combines strengths of two high-performing parents (short-term memory)
- Mutation operator injects insights from global best solution (long-term memory)
- Elitism strategy preserves best individuals across generations
Inspired by HSEvo, we track population diversity throughout evolution:
- SWDI (Shannon-Wiener Diversity Index): Measures instantaneous population diversity using hierarchical clustering
- CDI (Cumulative Diversity Index): Evaluates overall exploration via Minimum Spanning Tree analysis
- Semantic embeddings from fine-tuned CodeT5 for meaningful code similarity assessment
Our GP agent uses LLMs to perform semantic evolution on Python code, inspired by the ReEvo framework for Automatic Heuristic Design:
- Selects two parents from current population via tournament selection
- LLM analyzes why Parent A outperforms Parent B
- Generates offspring combining strengths of both parents
- Exploits immediate, local context of search frontier
- Pairs current individual with global best solution
- LLM incorporates insights from historical breakthrough
- Prevents population from forgetting globally successful patterns
- Acts as elitism strategy preserving elite knowledge
First, install the MLE-Bench environment following the official instructions.
git clone https://github.com/yourusername/ML-Master-GP.git
cd ML-Master-GP
conda create -n ml-master-gp python=3.12
conda activate ml-master-gp
# Install MLE-Bench (follow their README)
# Then install additional requirements
pip install -r requirements.txtFor diversity metrics, download the CodeT5 embedding model:
# The model should be placed in ./Salesforce/codet5p-110m-embedding/
# Or download from: https://huggingface.co/Salesforce/codet5p-110m-embeddingDownload and prepare the MLE-Bench dataset following their instructions. The dataset is over 2TB.
Expected structure:
/path/to/mle-bench/<competition-name>/
βββ prepared
βββ private/
β βββ test.csv
βββ public/
βββ description.md
βββ sample_submission.csv
βββ train.csv
Set your API credentials in run.sh:
# DeepSeek config (for code generation)
code_model=deepseek-v3
code_temp=0.5
code_base_url="your_base_url"
code_api_key="your_api_key"
# GPT config (for evaluation feedback)
feedback_model=gpt-4o-2024-08-06
feedback_temp=0.5
feedback_base_url="your_base_url"
feedback_api_key="your_api_key"
# Dataset and experiment config
EXP_ID=nomad2018-predict-transparent-conductors
dataset_dir=/path/to/mle-benchStart the grading server (validates submissions):
bash launch_server.shRun the GP agent:
bash run.shFor MCTS baseline comparison:
python main_mcts.py --exp_id nomad2018-predict-transparent-conductors \
--dataset_dir /path/to/mle-benchResults will be saved in:
./logs/- Execution logs and diversity metrics./working/- Generated code solutions
During development, we resolved two critical stability issues:
Problem: OSError: [Errno 5] Input/output error from print() statements in multi-threaded code.
Solution: Replaced all print() calls with thread-safe logging module.
Problem: Exception during execution caused interpreter slots to remain permanently occupied, leading to deadlock.
Solution: Implemented try-finally blocks to guarantee resource release:
try:
# Execution logic
...
finally:
# Force release of the slot
with self.lock:
if self.status_map[process_id] == 1:
self.status_map[process_id] = 0
self.current_parallel_run -= 1
self.cleanup_session(process_id=process_id)Extract and visualize diversity metrics:
# Extract diversity metrics from logs
python extract_diversity.py --log_dir ./logs/rungpnomad1
# Plot diversity evolution
python extract_and_plot.py --gp_log ./logs/rungpnomad1 \
--mcts_log ./logs/runnomad
# Compare code similarity between runs
python compare_similarity.py --log1 ./logs/run1 --log2 ./logs/run2ML-Master-GP/
βββ agent/
β βββ gp_agent.py # Genetic Programming agent
β βββ mcts_agent.py # MCTS baseline agent
βββ backend/
β βββ backend_openai.py # OpenAI API backend
β βββ backend_qwen.py # Qwen API backend
βββ search/
β βββ node.py # Solution node representation
β βββ mcts_node.py # MCTS-specific node
βββ utils/
β βββ diversity_utils.py # SWDI/CDI computation
β βββ llm_caller.py # LLM interaction utilities
β βββ config_mcts.yaml # Configuration file
βββ interpreter/
β βββ interpreter_parallel.py # Multi-threaded code execution
βββ Salesforce/
β βββ codet5p-110m-embedding/ # CodeT5 model for embeddings
βββ main_mcts.py # Entry point for GP agent
βββ extract_diversity.py # Diversity metrics extraction
βββ extract_and_plot.py # Visualization tools
βββ grading_server.py # Submission validation server
βββ report.tex # Technical report (LaTeX)
- Exploration vs Exploitation: GP's population-based approach explores diverse solutions simultaneously, while MCTS tends toward depth-first local refinement
- Memory Mechanisms: Crossover (short-term) and mutation (long-term) create a balanced cognitive architecture
- Diversity Maintenance: Explicit diversity metrics and injection strategies prevent premature convergence
- Creative Problem-Solving: GP excels at tasks requiring innovative solutions rather than incremental improvements
On the Nomad task, both GP and MCTS showed overfitting to validation metrics. This reflects the task's simplicity rather than algorithmic flaws. Early stopping can improve final test performance.
This work builds upon and is inspired by several excellent research projects:
- π² ML-Master - Base framework for AI-for-AI agents with exploration and reasoning
- π‘ MLE-Bench - Comprehensive AutoML benchmarking platform
- 𧬠ReEvo - LLM-driven code evolution for heuristic design
- π HSEvo - Diversity metrics and semantic similarity analysis for evolutionary algorithms
- π€ CodeT5 - Pre-trained code embedding model
- Evolutionary Computation: Eiben, A.E. and Smith, J.E., 2015. Introduction to evolutionary computing. Springer.
- ReEvo: Ye et al., 2024. "ReEvo: Large Language Models as Hyper-Heuristics with Reflective Evolution"
- HSEvo: Liu et al., 2024. "Enhancing Evolutionary Algorithms via Semantic Diversity Metrics"
- ML-Master: Liu et al., 2025. "ML-Master: Towards AI-for-AI via Integration of Exploration and Reasoning"
This project is released for academic research purposes. Please cite our work if you use this code:
@article{xiang2025diversity,
title={Diversity-Driven ML-Agent with Reflective Genetic Programming},
author={Xiang, Chuyang},
year={2025}
}For questions or issues, please open an issue on GitHub or contact the author.
Author: Chuyang Xiang (524031910627)