GitHub

For running single node experiments, run the following:

python driver.py

You can explore the options by passing the --help flag. Use the --use-gpu flag to run the experiments on GPUs. Certain flags have been set by default in the driver.py main() function for ease of experimentation.

For running multi-node experiments, a launcher script has been provided in train.sh. To use it, you must edit the script and set the number of nodes you wish to use (--num-nodes), the address of the master node (--master-addr), the total number of trainer processes desired (--num-trainers), and the network interface TensorPipe should use (setting the env vars GLOO_SOCKET_IFNAME and TP_SOCKET_IFNAME to the output of echo $(ip r | grep default | awk '{print $5}')). Then you can simply run the following on each node:

./train.sh <node_rank>

where <node_rank> is a unique rank for each node (0 ... num_nodes-1).

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
results		results
scalability		scalability
.gitignore		.gitignore
README.md		README.md
cleanup.sh		cleanup.sh
data.py		data.py
driver.py		driver.py
model.py		model.py
sbatch_launch.sh		sbatch_launch.sh
train.sh		train.sh
train_sbatch.sh		train_sbatch.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

About

Uh oh!

Releases

Packages

Languages

osalpekar/DLRM_RPC

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages