This repository contains the basic steps to start running scripts and notebooks on the EPFL RCP cluster. We provide scripts that make your life easier by automating most of the boilerplate. The setup is loosely based on infrastructure from TML/CLAIRE and earlier scripts by Atli.
The RCP cluster provides:
- GPUs: A100 (40GB/80GB), H100 (80GB), H200 (140GB), V100
- Stack: Docker (containers), Kubernetes (orchestration), run:ai (scheduler)
- FAQ: Check the frequently asked questions page
- Slack: Reach out on
#-clusteror#-itchannels - Resources: See quick links below
Tip
If you have little prior experience with ML workflows, the setup below may seem daunting at first. You can copy‑paste the commands in order; the scripts are designed to hide most of the complexity. The only requirement is that you have a basic understanding of how to use a terminal and git.
Caution
Using the cluster creates costs. Please be mindful of the resources you use. Do not forget to stop your jobs when not used!
Content overview:
- Quick Start
- Setup Guide
- Using VS Code
- Recommended Workflow
csub.pyUsage and Arguments- Advanced Topics
- Reference
Tip
TL;DR – After completing the setup, interaction with the cluster looks like this:
# Start an interactive job with 1 GPU
python csub.py -n sandbox
# Connect to your job
runai exec sandbox -it -- zsh
# Run your code
cd /mloscratch/homes/<your_username>
python main.py
# Or start a training job in one command
python csub.py -n experiment --train --command "cd /mloscratch/homes/<your_username>/<your_code>; python main.py"Important
Network requirement: You must be on the EPFL WiFi or connected to the VPN. The cluster is not accessible otherwise.
1. Request cluster access
Ask Jennifer or Martin to add you to the runai-mlo group: https://groups.epfl.ch/
2. Prepare your code repository
While waiting for access, create a GitHub repository for your code. This is best practice regardless of our cluster setup.
3. Set up experiment tracking (optional)
- Weights & Biases: Create an account at wandb.ai and get your API key
- Hugging Face: Create an account at huggingface.co and get your token (if using their models)
Important
Platform note: The setup below was tested on macOS with Apple Silicon. For other systems, adapt the commands accordingly.
- Linux: Replace
darwin/arm64withlinux/amd64in URLs - Windows: Use WSL (Windows Subsystem for Linux)
Download and install kubectl v1.30.11 (matching the cluster version):
# macOS with Apple Silicon
curl -LO "https://dl.k8s.io/release/v1.30.11/bin/darwin/arm64/kubectl"
# Linux (AMD64)
# curl -LO "https://dl.k8s.io/release/v1.30.11/bin/linux/amd64/kubectl"
# Install
chmod +x ./kubectl
sudo mv ./kubectl /usr/local/bin/kubectl
sudo chown root: /usr/local/bin/kubectlSee https://kubernetes.io/docs/tasks/tools/install-kubectl/ for other platforms.
Download the kube config file to ~/.kube/config:
curl -o ~/.kube/config https://raw.githubusercontent.com/epfml/getting-started/main/kubeconfig.yamlDownload and install the run:ai CLI:
# macOS with Apple Silicon
wget --content-disposition https://rcp-caas-prod.rcp.epfl.ch/cli/darwin
# Linux (replace 'darwin' with 'linux')
# wget --content-disposition https://rcp-caas-prod.rcp.epfl.ch/cli/linux
# Install
chmod +x ./runai
sudo mv ./runai /usr/local/bin/runai
sudo chown root: /usr/local/bin/runairunai login# List available projects
runai list projects
# Set your default project
runai config project mlo-$GASPAR_USERNAMEkubectl get nodesYou should see the RCP cluster nodes listed.
This setup keeps all personal configuration and secrets in a local .env file (never committed to git).
git clone https://github.com/epfml/getting-started.git
cd getting-started
cp user.env.example .envOpen .env in an editor and configure:
| Variable | Description | Example |
|---|---|---|
LDAP_USERNAME |
Your EPFL/Gaspar username | jdoe |
LDAP_UID |
Your numeric LDAP user ID | 123456 |
LDAP_GROUPNAME |
For MLO | MLO-unit |
LDAP_GID |
For MLO: 83070 |
83070 |
RUNAI_PROJECT |
Your project | mlo-<username> |
K8S_NAMESPACE |
Your namespace | runai-mlo-<username> |
RUNAI_IMAGE |
Docker image | ic-registry.epfl.ch/mlo/mlo-base:uv1 |
RUNAI_SECRET_NAME |
Secret name | runai-mlo-<username>-env |
WORKING_DIR |
Working directory | /mloscratch/homes/<username> |
To ensure correct file permissions:
# SSH into HaaS machine (use your Gaspar password)
ssh <your_gaspar_username>@haas001.rcp.epfl.ch
# Get your UID
idCopy the number after uid= (e.g., uid=123456) into LDAP_UID in your .env file.
Optionally configure in .env:
WANDB_API_KEY– Weights & Biases API keyHF_TOKEN– Hugging Face tokenGIT_USER_NAME/GIT_USER_EMAIL– Git identity for commits- GitHub SSH keys (auto-loaded from
~/.ssh/githubif empty):GITHUB_SSH_KEY_PATH/GITHUB_SSH_PUBLIC_KEY_PATH(to override default paths)
The secret is automatically synced when starting a job. To manually sync:
python csub.py --sync-secret-onlypython csub.py -n sandboxThis can take a few minutes. Monitor the status:
# List all jobs
runai list
# Check specific job status
runai describe job sandboxOnce the status shows Running:
runai exec sandbox -it -- zshYou should now be inside a terminal on the cluster! 🎉
Inside the pod, clone your code into your scratch home folder:
cd /mloscratch/homes/<your_username>
git clone https://github.com/<your_username>/<your_repo>.git
cd <your_repo>The default image includes uv as the recommended package manager (pip also works):
# Create and activate virtual environment
uv venv .venv
source .venv/bin/activate
# Install dependencies
uv pip install -r requirements.txtpython main.pyIf you configured WANDB_API_KEY or HF_TOKEN in .env, authentication should work automatically.
For remote development on the cluster:
-
Install extensions
-
Attach to your pod
- Navigate to: Kubernetes → rcp-cluster → Workloads → Pods
- Right-click your pod → Attach Visual Studio Code
- Open
/mloscratch/homes/<your_username>in the remote session
For detailed instructions, see the Managing Workflows guide.
Tip
Development cycle:
- Develop code locally or on the cluster (using VS Code)
- Push changes to GitHub
- Run experiments on the cluster via
runai exec sandbox -it -- zsh - Keep code and experiments organized and reproducible
Important
Critical reminders:
- Pods can be killed anytime – Implement checkpointing and recovery
- Store files on scratch – Everything in
~/is lost when pods restart - Use
/mloscratch/homes/<username>– Shell config and VS Code settings persist here - Delete failed jobs – Run
runai delete job <name>before restarting - Background jobs – Use training mode:
python csub.py -n exp --train --command "..."
Caution
Using the cluster creates costs. Always stop your jobs when not in use!
For detailed workflow guidance, see the Managing Workflows guide.
The csub.py script is a thin wrapper around the run:ai CLI that simplifies job submission by:
- Reading configuration and secrets from
.env - Syncing Kubernetes secrets automatically
- Constructing and executing
runai submitcommands
python csub.py -n <job_name> -g <num_gpus> -t <time> --command "<cmd>" [--train]# CPU-only pod for development
python csub.py -n dev-cpu
# Interactive development pod with 1 GPU
python csub.py -n dev-gpu -g 1
# Training job with 4 A100 GPUs
python csub.py -n experiment --train -g 4 --command "cd /mloscratch/homes/user/code; python train.py"
# Use specific GPU type
python csub.py -n my-job -g 2 --node-type h100 --train --command "..."
# Dry run (see command without executing)
python csub.py -n test --dry --command "..."| Argument | Description | Default |
|---|---|---|
-n, --name |
Job name | Auto-generated (username + timestamp) |
-g, --gpus |
Number of GPUs | 0 (CPU-only) |
-t, --time |
Maximum runtime (e.g., 12h, 2d6h30m) |
12h |
-c, --command |
Command to run | sleep <duration> |
--train |
Submit as training workload (non-interactive) | Interactive mode |
-i, --image |
Docker image | From RUNAI_IMAGE in .env |
--node-type |
GPU type: v100, h100, h200, default, a100-40g |
default (A100) |
--cpus |
Number of CPUs | Platform default |
--memory |
CPU memory request | Platform default |
-p, --port |
Expose container port (for Jupyter, etc.) | None |
--large-shm |
Request larger /dev/shm |
False |
--host-ipc |
Share host IPC namespace | False |
--backofflimit |
Retries before marking training job failed | 0 |
| Argument | Description |
|---|---|
--sync-secret-only |
Only sync .env to Kubernetes secret, don't submit job |
--skip-secret-sync |
Don't sync secret before submission |
--secret-name |
Override RUNAI_SECRET_NAME from .env |
--env-file |
Path to .env file |
| Argument | Description |
|---|---|
--uid |
Override LDAP_UID from .env |
--gid |
Override LDAP_GID from .env |
--pvc |
Override SCRATCH_PVC from .env |
--dry |
Print command without executing |
After submitting, csub.py prints useful follow-up commands:
runai describe job <name> # Check job status
runai logs <name> # View logs
runai exec <name> -it -- zsh # Connect to pod
runai delete job <name> # Delete jobRun python csub.py -h for the complete help text.
For detailed guides on day-to-day operations, see the Managing Workflows guide:
- Pod management – Commands to list, describe, delete jobs
- Important workflow notes – Job types, GPU selection, best practices
- HaaS machine – File transfer between storage systems
- File management – Understanding storage (mloscratch, mlodata1, mloraw1)
- Run:ai CLI directly: See
docs/runai_cli.mdfor using run:ai withoutcsub.py - Custom Docker images: See Creating Custom Images
- Distributed training: See
docs/multinode.mdfor multi-node jobs
If you need custom dependencies:
-
Get registry access
- Login at https://ic-registry.epfl.ch/ and verify you see the MLO project
- The
runai-mlogroup should already have access
-
Install Docker
brew install --cask docker # macOSIf you get "Cannot connect to the Docker daemon", run Docker Desktop GUI first.
-
Login to registry
docker login ic-registry.epfl.ch # Use GASPAR credentials -
Modify and publish
- Edit
docker/Dockerfileas needed - Use
docker/publish.shto build and push - Important: Rename your image (e.g.,
mlo/<your-username>:tag) to avoid overwriting the default
- Edit
Example workflow:
docker build . -t <your-tag>
docker tag <your-tag> ic-registry.epfl.ch/mlo/<your-tag>
docker push ic-registry.epfl.ch/mlo/<your-tag>See also Matteo's custom Docker example.
To access services running in your pod (e.g., Jupyter):
kubectl get pods
kubectl port-forward <pod_name> 8888:8888Then access at http://localhost:8888
For multi-node training across several compute nodes, see the detailed guide:
- Documentation:
docs/multinode.md - Official docs: https://docs.run.ai/v2.13/Researcher/cli-reference/runai-submit-dist-pytorch/
├── csub.py # Job submission wrapper (wraps runai submit)
├── utils.py # Python helpers for csub.py
├── user.env.example # Template for .env (copy and configure)
├── docker/
│ ├── Dockerfile # uv-enabled base image (RCP template)
│ ├── entrypoint.sh # Runtime bootstrap script
│ └── publish.sh # Build and push Docker images
├── kubeconfig.yaml # Kubeconfig template for ~/.kube/config
└── docs/
├── faq.md # Frequently asked questions
├── managing_workflows.md # Day-to-day operations guide
├── README.md # Architecture deep dive
├── runai_cli.md # Alternative run:ai CLI workflows
├── multinode.md # Multi-node/distributed training
└── how_to_use_k8s_secret.md # Kubernetes secrets reference
For technical details about the Docker image, entrypoint script, environment variables, and secret management:
Read the architecture explainer: docs/README.md
Topics covered:
- Runtime environment and entrypoint
- Permissions model and shared caches
- uv-based Python workflow
- Images and publishing
- Secrets, SSH, and Kubernetes integration
RCP Resources
run:ai Documentation
Related Resources
- Compute and Storage @ CLAIRE – Similar setup by colleagues
MLO Cluster Repositories (OUTDATED)
These repositories contain shared tooling and infrastructure (by previous PhD students). Contact Martin for editor access. They are outdated and not maintained anymore.
- epfml/epfml-utils – Python package for shared tooling (
pip install epfml-utils) - epfml/mlocluster-setup – Base images and setup for semi-permanent machines