A distributed deep learning approach implemented with Flower and TensorFlow
First, set up the local device that will coordinate training.
Clone the repo and move into it:
$ git clone https://github.com/dfl0/hcddl.git && cd hcddlCreate a Python (
$ python -m venv venvInstall dependencies:
$ venv/bin/pip install -r resources/requirements.txtFinally, make sure Ansible is also installed on your local machine.
Next, create an inventory file specifying the pool of devices to participate in the training. Either host aliases or direct IP addresses can be used. For example,
[pool]
pi1
pi2
pi3
pi4
pi5
pi6
pi7
...Or alternatively,
[pool]
192.168.2.101
192.168.2.102
192.168.2.103
192.168.2.104
192.168.2.105
192.168.2.106
192.168.2.107
...The name of the inventory file is arbitrary, but it must include a pool group and follow a format supported by Ansible.
Tip
Use ansible -m ping -i <inventory-file> all to verify that all devices are reachable before setup.
Before doing anything, all the devices must be set up with the necessary Python version, dependencies, and files. The following command will take care of everything:
$ ansible-playbook -i <inventory-file> -u <remote-user> playbooks/setup.ymlThis will do a few things:
- Ensure the necessary Python version is installed on all the devices
- Create a new directory
~/hcddl - Create a virtual environment in
~/hcddland installs all the necessary packages to it - Copy over scripts to start up the training roles on the devices
You will need to create a config.yml file to specify the training parameters and desired architecture with associated options. This supports both flat (single-layer) and hierarchical (two-layer) architectures.
Important
Make sure your device pool has a sufficient number of devices to support your configuration.
You can set the number of server rounds, target accuracy, and batch size in config.yml as follows:
---
num_server_rounds: 5
target_accuracy: 0.9
batch_size: 32
...To use a flat architecture for training, set arch to flat and specify just a parameter server by giving it a type and number of workers.
For example, a full config.yml file could look like this:
---
num_server_rounds: 5
target_accuracy: 0.9
batch_size: 32
arch: flat
parameter_server:
aggr_type: sync
num_workers: 4This configuration would assign the devices in the above inventory file as follows:
graph TD;
pi1---pi2;
pi1---pi3;
pi1---pi4;
pi1---pi5;
Here, pi1 is the parameter server, and pi2, pi3, pi4, and pi5 are the workers.
To use a hierarchical architecture for training, set arch to hierarchical and specify a global parameter server by giving it a type, as well as its local servers each with their own type and number of workers.
For example, a full config.yml file could look like this:
---
num_server_rounds: 5
target_accuracy: 0.9
batch_size: 32
arch: hierarchical
global_server:
aggr_type: sync
local_servers:
- type: async
num_workers: 2
- type: async
num_workers: 2This configuration would assign the devices in the above inventory file as follows:
graph TD;
pi1---pi2;
pi2---pi3;
pi2---pi4;
pi1---pi5;
pi5---pi6;
pi5---pi7;
Here, pi1 is the global parameter server, pi2 and pi5 are the local parameter servers, and pi3, pi4, pi6, and pi7, are the workers.
Now that everything is set up, the training can be deployed and started with the following command:
$ ansible-playbook -i <inventory-file> -u <remote-user> playbooks/run.ymlOnce the training begins, logs are written to ~/hcddl/run.log on all of the devices. To see the live output on any single device, the log file can be viewed on the device itself as it is updated with the following command:
$ tail -f ~/hcddl/run.logAlternatively, this can be done from the local machine directly with the following command:
$ ssh <remote-user>@<remote-host> "tail -f ~/hcddl/run.log"If any Flower processes remain active on the remote devices on the remote devices (e.g., if the local machine goes down), they can be terminated from the local machine by running the following command:
$ ansible -m shell -a "pgrep flower | xargs kill" allThis takes care of all remote devices specified in the inventory file.