This repo is a staging ground for the new MLPerf reinforcement model. Eventually this code will replace the code in this directory.
This task benchmarks on policy reinforcement learning for the 9x9 version of the boardgame Go. The model plays games against itself and uses these games to improve play.
To setup the environment on Ubuntu 16.04 (16 CPUs, one P100, 100 GB disk), you can use these commands. This may vary on a different operating system or graphics card.
# Clone repository
git clone https://github.com/tensorflow/minigo
# Note: This will eventually change to:
# git clone http://github.com/mlperf/training
# Install dependencies
apt-get install -y python3 python3-pip rsync git wget pkg-config zip g++ zlib1g-dev unzip
# Create a virtualenv (this step is optional but highly recommended).
pip3 install virtualenv
pip3 install virtualenvwrapper
virtualenv -p /usr/bin/python3 --system-site-packages $HOME/.venvs/minigo
source $HOME/.venvs/minigo/bin/activate
# Install Python dependencies
pip3 install -r requirements.txt
# Install Python Tensorflow for GPU
# (alternatively use "tensorflow>=1.13.1" for CPU Tensorflow)
pip3 install "tensorflow-gpu>=1.13.1"
# Install bazel
wget https://github.com/bazelbuild/bazel/releases/download/0.19.2/bazel-0.19.2-installer-linux-x86_64.sh
chmod +x bazel-0.19.2-installer-linux-x86_64.sh
sudo ./bazel-0.19.2-installer-linux-x86_64.sh
rm bazel-0.19.2-installer-linux-x86_64.sh
# Compile TensorFlow C++ libraries
sudo sh -c "echo /usr/local/cuda/lib64 > /etc/ld.so.conf.d/cuda.conf"
sudo ldconfig
./cc/configure_tensorflow.sh
# Compile and run C++ self-play and evaluation binaries
bazel build -c opt --define=tf=1 --define=board_size=9 cc:selfplay cc:eval
# Download required files from Google Cloud Storage
BOARD_SIZE=9 python ml_perf/get_data.py
BASE_DIR=$(pwd)/results/$(date +%Y-%m-%d)
# Run training loop
BOARD_SIZE=9 python ml_perf/reference_implementation.py \
--base_dir=$BASE_DIR \
--flagfile=ml_perf/flags/9/rl_loop.flags
# Once the training loop has finished, run model evaluation to find the
# first trained model that's better than the target
BOARD_SIZE=9 python ml_perf/eval_models.py \
--base_dir=$BASE_DIR \
--flags_dir=ml_perf/flags/9
Unlike other benchmarks, there is no data to download. All training data comes from games played during benchmarking.
This benchmark is based on the Minigo project, which is and inspired by the work done by Deepmind with "Mastering the Game of Go with Deep Neural Networks and Tree Search", "Mastering the Game of Go without Human Knowledge", and "Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm".
Minigo is built on top of Brian Lee's MuGo, a pure Python implementation of the first AlphaGo paper.
Note that Minigo is an independent effort from AlphaGo.
This benchmark includes both the environment and training for 9x9 Go. There are four primary phases in this benchmark, these phases are repeated in order:
- Selfplay: the current best model plays games with itself as both black and white to produce board positions for training.
- Training: train the neural networks using selfplay data from recent models. The neural network weights are updated from the recent selfplay games.
- Model Evaluation: the current best and the most recently trained model play a series of games. In order to become the new current best, the most recently trained model must win 55% of the games.
- Target Evaluation: if the newly trained model has been promoted to the current best, play a series of games against a target model that was previously trained via this reference benchmark. The termination criteria for the benchmark is to win at least 50% of the games.
This task has a non-trivial network structure, including a search tree. A good overview of the structure can be found here: https://medium.com/applied-data-science/alphago-zero-explained-in-one-diagram-365f5abf67e0.
Network weights are initialized randomly. Initialization and loss are described here; "Mastering the Game of Go with Deep Neural Networks and Tree Search"
We use a MomentumOptimizer to train the network.
Due to the difficulty of training a highly proficient Go model, our quality metric and termination criteria is based on winning against a model of only intermediate amateur strength.
The quality of a model is measured as the number of games won in a playoff (alternating colors) of 100 games against a previously trained model.
The quality target is to win 50% of the games.
Informally, we have observed that quality should improve roughly linearly with time. We observed roughly 0.5% improvement in quality per hour of runtime. An example of approximately how we've seen quality progress over time:
Approx. Hours to Quality (16 CPU & 1 P100)
1h x%
2h x%
4h x%
8h x%
Note that quality does not necessarily monotonically increase.
Target evaluation only needs to be performed for models which pass model evaluation.