A comprehensive guide for (1) setting up Run:ai with helper scripts, (2) running PyTorch, Isaac Sim, Isaac Lab, Cosmos, CUDA, and more workloads on Run:ai, and (3) using SSH, VNC, Jupyter Lab, VSCode, TensorBoard, Nsight Systems, Nsight Compute, and more tools on Run:ai.
For running Isaac Sim workloads on Omniverse Farm, please refer to j3soon/omni-farm-isaac. These two workload managers can be used together. Adding a Run:ai project with name
ov-farmwill allow Run:ai to act as a scheduler for Omniverse Farm.
For new users, we strongly recommend reading this entire guide and following the instructions step by step. You can skip optional sections and ignore links unless needed.
In the past, skipping this guide has led to serious issues including code and data loss when containers are terminated.
Only skip the guide if you are fully confident in what you're doing. Proceed at your own risk.
| Isaac Sim | Isaac Lab |
|---|---|
isaac-sim-vnc.mp4 |
isaac-lab-vnc.mp4 |
Please skip this section during your first read.
- j3soon/runai-all-in-one
- j3soon/runai-pytorch-mnist
- j3soon/runai-isaac-sim:4.5.0
- j3soon/runai-isaac-sim:5.0.0
- j3soon/runai-isaac-sim-ex:4.5.0
- j3soon/runai-isaac-sim-ex:5.0.0
- j3soon/runai-isaac-lab:2.1.0
- j3soon/runai-isaac-lab:2.2.0
- j3soon/runai-isaac-lab-ex:2.1.0
- j3soon/runai-isaac-lab-ex:2.2.0
- j3soon/runai-cosmos-predict1
- j3soon/runai-cosmos-transfer1
- j3soon/runai-nvhpc:25.5-devel-cuda_multi-ubuntu22.04
See the Applications section for more example applications.
Please skip this section during your first read.
- SSH
- VNC
- Jupyter Lab
- VSCode
- TensorBoard
- Nsight Systems
- Nsight Compute
See the Tools section for more tool details.
Skip this section if you are a normal user.
For cluster admins, please refer to install.md.
Note that this section is optional if you plan to use the Run:ai Dashboard directly. However, you'll need to keep in mind the following secrets and use them accordingly.
Clone this repository:
git clone https://github.com/j3soon/run-ai-isaac.git
cd run-ai-isaacFill in the Run:ai server information in secrets/env.sh based on the information provided by the cluster admin, for example:
export RUNAI_URL="<RUNAI_URL>"
export STORAGE_NODE_IP="<STORAGE_NODE_IP>"
export FTP_USER="<FTP_USER>"
export FTP_PASS="<FTP_PASS>"
Skip this section if accessing your Run:ai cluster doesn't require a VPN.
Use the OpenVPN Connect v3 GUI to connect to the VPN:
-
Windows (including WSL2) users: Follow the official guide.
-
MacOS users: Follow the official guide.
-
Linux users: Use the command line to install OpenVPN 3 Client by following the official guide.
Then, copy your
.ovpnclient config file tosecrets/client.ovpnand install the config, and connect to the VPN with:scripts/vpn/install_config.sh client.ovpn scripts/vpn/connect.sh
To disconnect from the VPN, and uninstall the VPN config, run:
scripts/vpn/disconnect.sh scripts/vpn/uninstall_config.sh
These 4 scripts are just wrappers for the
openvpn3command line tool. See the official documentation for more details.
If you need to connect multiple machines to the VPN simultaneously, avoid using the same VPN profile. Doing so may cause one machine to disconnect when another connects. Consider asking the cluster admin to generate separate VPN profiles for each of your machine.
Go to <RUNAI_URL>. Ignore the warning about the self-signed certificate:
- Google Chrome:
Your connection is not private > Advanced > Proceed to <RUNAI_URL> (unsafe) - Firefox:
Warning: Potential Security Risk Ahead > Advanced... > Accept the Risk and Continue
Log in with your Run:ai account with credentials <RUNAI_USER_EMAIL> and <RUNAI_USER_PASSWORD> received from the cluster admin.
You will be prompted to change your password. Make sure to take note of the new password.
We strongly recommend following the instructions at least once to understand the cluster's logic. For example, any data stored outside the persistent NFS volume will be deleted when the container is terminated.
Pre-built Docker images for Isaac Sim, Isaac Lab, and other applications are described at the end of this document. However, we recommend following the instructions below at least once to familiarize yourself with the workflow.
We take the PyTorch MNIST training code as an example.
-
Prepare your custom code and data.
# Download code git clone https://github.com/pytorch/examples.git sed -i 's/download=True/download=False/g' examples/mnist/main.py # Download data # Ref: https://github.com/pytorch/vision/blob/main/torchvision/datasets/mnist.py mkdir -p examples/data/MNIST/raw && cd examples/data/MNIST/raw wget https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz wget https://ossci-datasets.s3.amazonaws.com/mnist/train-labels-idx1-ubyte.gz wget https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz wget https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz cd ../../../../
-
Build a custom Docker image with the necessary dependencies and scripts for your tasks and upload it to Docker Hub.
docker build -t j3soon/runai-pytorch-mnist -f docker/pytorch-mnist/Dockerfile . docker push j3soon/runai-pytorch-mnistNote that this step is optional if you are using our pre-built Docker images.
It is highly recommended to build your custom Docker images in a Linux environment (with NVIDIA Driver, Docker, and NVIDIA Container Toolkit installed). Building on Windows is strongly discouraged for beginners unless you know exactly what you are doing.
In this example, dependencies are not installed in the Dockerfile. However, in practice, you will want to select a suitable base image and pre-install all dependencies in the Dockerfile such as
pip install -r requirements.txtto prevent the need of installing dependencies every time after launching a container. You may also want to delete the.dockerignorefile. In addition, ensure that you always copy therun.shfile and theomniclidirectory directly to the root directory (/) without any modifications, rather than placing them in other subdirectories. Failing to do so will result in errors, as the script relies on absolute paths. As a side note, if your code will not be modified, you can also directly copy the code to your Docker image. However, this is usually not the case, as you often want to update your code without rebuilding the Docker image. -
Upload your dataset and code to storage node through FTP.
This could be done by either FileZilla or
lftp.Note that some FileZilla installer may contain adware. Make sure the name of the installer does not container the word
sponsored.For FileZilla, enter the Host
${STORAGE_NODE_IP}inenv.shand enter the${FTP_USER}and${FTP_PASS}provided by the cluster admin. Also make sure to setEdit > Settings > Transfers > File Types > Default transfer type > Binaryto prevent the endlines from being changed, see this post for more details.For
lftp, on your local machine run:source secrets/env.sh # Install and set up lftp sudo apt-get update && sudo apt-get install -y lftp echo "set ssl:verify-certificate no" >> ~/.lftprc # Connect to storage node lftp -u ${FTP_USER},${FTP_PASS} ${STORAGE_NODE_IP}
Inside the
lftpsession, run:cd /mnt/nfs ls mkdir <YOUR_USERNAME> cd <YOUR_USERNAME> # Delete old dataset and code rm -r data rm -r mnist # Upload dataset and code mirror --reverse examples/data data mirror --reverse examples/mnist mnist # Don't close this session just yet, we will need it later
When uploading a newer version of your code or dataset, always delete the existing directory first. This ensures that any files removed in the new version are not left behind. If you expect you will run a newer version of your code while previous tasks are still running, consider implementing a versioning system by including a version tag in the file path to prevent conflict.
-
Create a new environment for your docker image.
Go to
Workload manager > Assets > Environmentsand click+ NEW ENVIRONMENT.Fill in the following fields:
- Scope
runai/runai-cluster/<YOUR_LAB>/<YOUR_PROJECT> - Environment name
<YOUR_USERNAME>-pytorch-mnist - Workload architecture & type
- Select the type of workload that can use this environment:
Workspace: ✅ (Checked) Training: ⬜ (Unchecked) Inference: ⬜ (Unchecked)
- Select the type of workload that can use this environment:
- Image
- Image URL
j3soon/runai-pytorch-mnist - Image pull policy
Always pull the image from the registry
- Image URL
- Tools
- Tool
Jupyter
- Tool
- Runtime settings
- Command
/run.sh "pip install jupyterlab" "jupyter lab --ip=0.0.0.0 --no-browser --allow-root --NotebookApp.base_url=/${RUNAI_PROJECT}/${RUNAI_JOB_NAME} --NotebookApp.token='' --notebook-dir=/" - Arguments: (Keep empty)
- Command
- Security
- Set where the UID, GID, and supplementary groups for the container should be taken from
From the imageIn newer versions of Run:ai, the default value may be
From the IdP token.
- Set where the UID, GID, and supplementary groups for the container should be taken from
and then click
CREATE ENVIRONMENT.You should create a new environment for each docker image you want to use. In most cases, you will only need to create one environment. In addition, you can add more tools to the environment, such as TensorBoard, or opening custom ports using the
Customtool andNodePortconnection type. - Scope
-
Create a new GPU workload based on the environment.
Go to
Workload manager > Workloadsand click+ NEW WORKLOAD > Workspace.Fill in the following fields:
-
Workspace name
<YOUR_USERNAME>-pytorch-mnist-test1and click
CONTINUE. -
Environment
- Select the environment for your workload:
<YOUR_USERNAME>-pytorch-mnist - (Optional) Set the connection for your tool(s):
Jupyter Access: Set to Specific user(s)This optional step is not included in the screenshot below.
- Select the environment for your workload:
-
Compute resource
- Select the node resources needed to run your workload:
gpu-x1
- Select the node resources needed to run your workload:
-
Data sources
- Select the data sources your workload needs to access:
<YOUR_LAB>-nfs
- Select the data sources your workload needs to access:
-
General
- Set the backoff limit before workload failure:
Attempts: 1
- Set the backoff limit before workload failure:
and then click
CREATE WORKSPACE.Make sure to not accidentally select the default
jupyter-labenvironment. If you do, you'll see ajovyanuser instead ofroot. In such case, recreate the workload with the correct environment<YOUR_USERNAME>-pytorch-mnist.In our case, we didn't limit the Jupyter access to specific users, so anyone can access the Jupyter Lab.
The
/run.shfile mentioned here is the samerun.shscript that was copied directly into the Docker image without any modifications during the second step. This pre-written helper script streamlines file downloads and uploads to and from Nucleus while also supporting the sequential execution of multiple commands. -
-
Connect to the Jupyter Lab.
In
Workload manager > Workloads, select the workload you just created and clickCONNECT > Jupyterand clickTerminal. -
Extract the dataset.
In the Jupyter Lab terminal, run:
cd /mnt/nfs/<YOUR_USERNAME>/data/MNIST/raw ls gzip -dk train-images-idx3-ubyte.gz gzip -dk train-labels-idx1-ubyte.gz gzip -dk t10k-images-idx3-ubyte.gz gzip -dk t10k-labels-idx1-ubyte.gz ls
Although
/mnt/nfsis a Network File System (NFS) mounted volume, it typically isn't the bottleneck during training. However, if you notice that your dataloader is causing performance issues, consider copying the dataset to the container's local storage before starting the training process. The NFS volume may also cause issues if you are usingtaron the mounted volume, make sure to use the--no-same-ownerflag to prevent thetar: XXX: Cannot change ownership to uid XXX, gid XXX: Operation not permittederror. -
Start Training.
In the Jupyter Lab terminal, run:
nvidia-smi apt-get update apt-get install -y tree tree /mnt/nfs/<YOUR_USERNAME>/data cd /mnt/nfs/<YOUR_USERNAME>/mnist pip install -r requirements.txt python main.py --save-model --epochs 1
The
apt-get installandpip installcommands here are only for demonstration purposes, installing packages during runtime is not recommended, as it can slow down the task and potentially cause issues. It is recommended to include all dependencies in the Docker image by specifying them in the Dockerfile.Make sure to store all checkpoints and output files in
/mnt/nfs. Otherwise, after the container is terminated, all files outside of/mnt/nfs(including the home directory) will be permanently deleted. This is because containers are ephemeral and only the NFS mount persists between runs. -
Download the results.
Inside the previous
lftpsession, run:cd /mnt/nfs/<YOUR_USERNAME>/mnist cache flush ls # Download the results get mnist_cnn.pt rm mnist_cnn.pt
Make sure to delete the results after downloading to save storage space.
-
Delete the workload.
Go to
Workload manager > Workloadsand select the workload you just created and clickDELETE. Please alwaysSTOPorDELETEthe workload after you are done with the task to allow maximum resource utilization. -
Alternative to interactive Jupyter Lab workloads, you may want to submit a batch workload.
Go to
Workload manager > Workloadsand click+ NEW WORKLOAD > Workspace.Fill in the following fields:
-
Workspace name
<YOUR_USERNAME>-pytorch-mnist-test2and click
CONTINUE. -
Environment
- Select the environment for your workload:
<YOUR_USERNAME>-pytorch-mnist - Set a command and arguments for the container running in the pod:
- Command
/run.sh "cd /mnt/nfs/<YOUR_USERNAME>/mnist" "python main.py --save-model --epochs 1"
- Command
- Select the environment for your workload:
-
Compute resource
- Select the node resources needed to run your workload:
gpu-x1
- Select the node resources needed to run your workload:
-
Data sources
- Select the data sources your workload needs to access:
<YOUR_LAB>-nfs
- Select the data sources your workload needs to access:
-
General
- Set the backoff limit before workload failure:
Attempts: 1
- Set the backoff limit before workload failure:
and then click
CREATE WORKSPACE.Note that the batch workload will automatically restart once when it fails since we set the backoff limit to 1. There is currently no way to set the backoff limit to 0, so make sure a workload restart will not overwrite your previous results.
After the workload is completed, click
SHOW DETAILSto see the logs. -
-
Similar to the interactive workload, you should see the checkpoint and output files at
/mnt/nfs/<YOUR_USERNAME>/mnist/mnist_cnn.ptthrough FTP.
Make sure to always add your username as a prefix to your environment name and workload name. This helps preventing others from accidentally modifying your setups.
For downloading large files or directories, consider using
tarwithpigzto compress the files in parallel. Seetar + pigzandtar + pv + pigzfor examples.
As a side note, you may want to use Wandb to log your training results. This allows you to visualize your training progress of all your workloads in a single dashboard.
Now that you have a basic understanding of the workflow, here are a few tips to help you work more efficiently:
-
Build and test locally first. Always create your custom Docker image on a local Linux machine and test it there before deploying to Run:ai. This makes debugging easier and prevents wasting GPU resources on Run:ai.
-
Use persistent storage wisely. Store all code and data in the persistent NFS volume, back them up regularly to your local machine, and remove unnecessary files to save shared storage space on Run:ai. To minimize performance impact, copy the dataset to the container's local storage before starting the training process, and reduce checkpointing frequency.
-
Prefer batch workloads. When possible, use batch workloads so containers terminate automatically after tasks complete, freeing GPU resources for others.
-
Use interactive Jupyter Lab only when needed. Reserve interactive workloads for debugging, and always stop or delete them when finished to release the resources. Depending on your cluster policy, idle interactive workloads may be automatically terminated without warning after a set time or during maintenance. Keeping an idle interactive workload running for days is often frowned upon, unless you have contacted the cluster admin and received explicit permission.
-
Request for minimal GPU resources. If you are not sure about the minimum GPU resources required for your task, request for minimal resources (
gpu-x1) first. You can always request for more resources (e.g.,gpu-x2,gpu-x4,gpu-x8) later. In addition, don't submit CPU workloads (gpu-x0,cpu-only) on a GPU node pool unless you have contacted the cluster admin and received explicit permission.
For more sample applications (such as Isaac Sim and Isaac Lab), please refer to the Applications section.
See the Developer Notes for more details.
This project has been made possible through the support of NVIDIA AI Technology Center (NVAITC).
I must thank Kuan-Ting Yeh for his invaluable support in investigating and resolving various issues, whether it was day or night, weekday or weekend.
Disclaimer: this is not an official NVIDIA product.
For more information on how to use Run:ai, please refer to the Run:ai Documentation.