MLO: Getting Started with the EPFL RCP Cluster

This repository contains the basic steps to start running scripts and notebooks on the EPFL RCP cluster. We provide scripts that make your life easier by automating most of the boilerplate. The setup is loosely based on infrastructure from TML/CLAIRE and earlier scripts by Atli.

Overview

The RCP cluster provides:

GPUs: A100 (40GB/80GB), H100 (80GB), H200 (140GB), V100
Stack: Docker (containers), Kubernetes (orchestration), run:ai (scheduler)

Getting Help

FAQ: Check the frequently asked questions page
Slack: Reach out on #-cluster or #-it channels
Resources: See quick links below

Tip

If you have little prior experience with ML workflows, the setup below may seem daunting at first. You can copy‑paste the commands in order; the scripts are designed to hide most of the complexity. The only requirement is that you have a basic understanding of how to use a terminal and git.

Caution

Using the cluster creates costs. Please be mindful of the resources you use. Do not forget to stop your jobs when not used!

Content overview:

Quick Start

Tip

TL;DR – After completing the setup, interaction with the cluster looks like this:

# Start an interactive job with 1 GPU
python csub.py -n sandbox

# Connect to your job
runai exec sandbox -it -- zsh

# Run your code
cd /mloscratch/homes/<your_username>
python main.py

# Or start a training job in one command
python csub.py -n experiment --train --command "cd /mloscratch/homes/<your_username>/<your_code>; python main.py"

Setup Guide

Important

Network requirement: You must be on the EPFL WiFi or connected to the VPN. The cluster is not accessible otherwise.

1. Pre-setup (Access & Repository)

1. Request cluster access

Ask Jennifer or Martin to add you to the runai-mlo group: https://groups.epfl.ch/

2. Prepare your code repository

While waiting for access, create a GitHub repository for your code. This is best practice regardless of our cluster setup.

3. Set up experiment tracking (optional)

Weights & Biases: Create an account at wandb.ai and get your API key
Hugging Face: Create an account at huggingface.co and get your token (if using their models)

2. Setup Tools on Your Machine

Important

Platform note: The setup below was tested on macOS with Apple Silicon. For other systems, adapt the commands accordingly.

Linux: Replace darwin/arm64 with linux/amd64 in URLs
Windows: Use WSL (Windows Subsystem for Linux)

Install kubectl

Download and install kubectl v1.30.11 (matching the cluster version):

# macOS with Apple Silicon
curl -LO "https://dl.k8s.io/release/v1.30.11/bin/darwin/arm64/kubectl"

# Linux (AMD64)
# curl -LO "https://dl.k8s.io/release/v1.30.11/bin/linux/amd64/kubectl"

# Install
chmod +x ./kubectl
sudo mv ./kubectl /usr/local/bin/kubectl
sudo chown root: /usr/local/bin/kubectl

See https://kubernetes.io/docs/tasks/tools/install-kubectl/ for other platforms.

Setup kubeconfig

Download the kube config file to ~/.kube/config:

curl -o ~/.kube/config https://raw.githubusercontent.com/epfml/getting-started/main/kubeconfig.yaml

Install run:ai CLI

Download and install the run:ai CLI:

# macOS with Apple Silicon
wget --content-disposition https://rcp-caas-prod.rcp.epfl.ch/cli/darwin

# Linux (replace 'darwin' with 'linux')
# wget --content-disposition https://rcp-caas-prod.rcp.epfl.ch/cli/linux

# Install
chmod +x ./runai
sudo mv ./runai /usr/local/bin/runai
sudo chown root: /usr/local/bin/runai

3. Login to the Cluster

Login to run:ai

runai login

Verify access

# List available projects
runai list projects

# Set your default project
runai config project mlo-$GASPAR_USERNAME

Verify Kubernetes connection

kubectl get nodes

You should see the RCP cluster nodes listed.

4. Configure Your `.env` File

This setup keeps all personal configuration and secrets in a local .env file (never committed to git).

Clone and create `.env`

git clone https://github.com/epfml/getting-started.git
cd getting-started
cp user.env.example .env

Fill in required fields

Open .env in an editor and configure:

Variable	Description	Example
`LDAP_USERNAME`	Your EPFL/Gaspar username	`jdoe`
`LDAP_UID`	Your numeric LDAP user ID	`123456`
`LDAP_GROUPNAME`	For MLO	`MLO-unit`
`LDAP_GID`	For MLO: `83070`	`83070`
`RUNAI_PROJECT`	Your project	`mlo-<username>`
`K8S_NAMESPACE`	Your namespace	`runai-mlo-<username>`
`RUNAI_IMAGE`	Docker image	`ic-registry.epfl.ch/mlo/mlo-base:uv1`
`RUNAI_SECRET_NAME`	Secret name	`runai-mlo-<username>-env`
`WORKING_DIR`	Working directory	`/mloscratch/homes/<username>`

Find your LDAP UID

To ensure correct file permissions:

# SSH into HaaS machine (use your Gaspar password)
ssh <your_gaspar_username>@haas001.rcp.epfl.ch

# Get your UID
id

Copy the number after uid= (e.g., uid=123456) into LDAP_UID in your .env file.

Optional: Add secrets and tokens

Optionally configure in .env:

WANDB_API_KEY – Weights & Biases API key
HF_TOKEN – Hugging Face token
GIT_USER_NAME / GIT_USER_EMAIL – Git identity for commits
GitHub SSH keys (auto-loaded from ~/.ssh/github if empty):
- GITHUB_SSH_KEY_PATH / GITHUB_SSH_PUBLIC_KEY_PATH (to override default paths)

Sync your secret

The secret is automatically synced when starting a job. To manually sync:

python csub.py --sync-secret-only

5. Start Your First Job

Start an interactive pod

python csub.py -n sandbox

Wait for the pod to start

This can take a few minutes. Monitor the status:

# List all jobs
runai list

# Check specific job status
runai describe job sandbox

Connect to your pod

Once the status shows Running:

runai exec sandbox -it -- zsh

You should now be inside a terminal on the cluster! 🎉

6. Clone and Run Your Code

Clone your repository

Inside the pod, clone your code into your scratch home folder:

cd /mloscratch/homes/<your_username>
git clone https://github.com/<your_username>/<your_repo>.git
cd <your_repo>

Set up your Python environment

The default image includes uv as the recommended package manager (pip also works):

# Create and activate virtual environment
uv venv .venv
source .venv/bin/activate

# Install dependencies
uv pip install -r requirements.txt

Run your code

python main.py

If you configured WANDB_API_KEY or HF_TOKEN in .env, authentication should work automatically.

Using VS Code

For remote development on the cluster:

Install extensions
- Kubernetes
- Dev Containers
Attach to your pod
- Navigate to: Kubernetes → rcp-cluster → Workloads → Pods
- Right-click your pod → Attach Visual Studio Code
- Open /mloscratch/homes/<your_username> in the remote session

For detailed instructions, see the Managing Workflows guide.

Recommended Workflow

Tip

Development cycle:

Develop code locally or on the cluster (using VS Code)
Push changes to GitHub
Run experiments on the cluster via runai exec sandbox -it -- zsh
Keep code and experiments organized and reproducible

Important

Critical reminders:

Pods can be killed anytime – Implement checkpointing and recovery
Store files on scratch – Everything in ~/ is lost when pods restart
Use /mloscratch/homes/<username> – Shell config and VS Code settings persist here
Delete failed jobs – Run runai delete job <name> before restarting
Background jobs – Use training mode: python csub.py -n exp --train --command "..."

Caution

Using the cluster creates costs. Always stop your jobs when not in use!

For detailed workflow guidance, see the Managing Workflows guide.

`csub.py` Usage and Arguments

The csub.py script is a thin wrapper around the run:ai CLI that simplifies job submission by:

Reading configuration and secrets from .env
Syncing Kubernetes secrets automatically
Constructing and executing runai submit commands

Basic Usage

python csub.py -n <job_name> -g <num_gpus> -t <time> --command "<cmd>" [--train]

Common Examples

# CPU-only pod for development
python csub.py -n dev-cpu

# Interactive development pod with 1 GPU
python csub.py -n dev-gpu -g 1

# Training job with 4 A100 GPUs
python csub.py -n experiment --train -g 4 --command "cd /mloscratch/homes/user/code; python train.py"

# Use specific GPU type
python csub.py -n my-job -g 2 --node-type h100 --train --command "..."

# Dry run (see command without executing)
python csub.py -n test --dry --command "..."

Available Arguments

Argument	Description	Default
`-n`, `--name`	Job name	Auto-generated (username + timestamp)
`-g`, `--gpus`	Number of GPUs	`0` (CPU-only)
`-t`, `--time`	Maximum runtime (e.g., `12h`, `2d6h30m`)	`12h`
`-c`, `--command`	Command to run	`sleep <duration>`
`--train`	Submit as training workload (non-interactive)	Interactive mode
`-i`, `--image`	Docker image	From `RUNAI_IMAGE` in `.env`
`--node-type`	GPU type: `v100`, `h100`, `h200`, `default`, `a100-40g`	`default` (A100)
`--cpus`	Number of CPUs	Platform default
`--memory`	CPU memory request	Platform default
`-p`, `--port`	Expose container port (for Jupyter, etc.)	None
`--large-shm`	Request larger `/dev/shm`	False
`--host-ipc`	Share host IPC namespace	False
`--backofflimit`	Retries before marking training job failed	`0`

Secret Management

Argument	Description
`--sync-secret-only`	Only sync `.env` to Kubernetes secret, don't submit job
`--skip-secret-sync`	Don't sync secret before submission
`--secret-name`	Override `RUNAI_SECRET_NAME` from `.env`
`--env-file`	Path to `.env` file

Advanced Options

Argument	Description
`--uid`	Override `LDAP_UID` from `.env`
`--gid`	Override `LDAP_GID` from `.env`
`--pvc`	Override `SCRATCH_PVC` from `.env`
`--dry`	Print command without executing

After Submission

After submitting, csub.py prints useful follow-up commands:

runai describe job <name>  # Check job status
runai logs <name>          # View logs
runai exec <name> -it -- zsh  # Connect to pod
runai delete job <name>    # Delete job

Run python csub.py -h for the complete help text.

Advanced Topics

Managing Workflows

For detailed guides on day-to-day operations, see the Managing Workflows guide:

Pod management – Commands to list, describe, delete jobs
Important workflow notes – Job types, GPU selection, best practices
HaaS machine – File transfer between storage systems
File management – Understanding storage (mloscratch, mlodata1, mloraw1)

Alternative Workflows

Run:ai CLI directly: See docs/runai_cli.md for using run:ai without csub.py
Custom Docker images: See Creating Custom Images
Distributed training: See docs/multinode.md for multi-node jobs

Creating Custom Docker Images

If you need custom dependencies:

Get registry access
- Login at https://ic-registry.epfl.ch/ and verify you see the MLO project
- The runai-mlo group should already have access
Install Docker
```
brew install --cask docker  # macOS
```
If you get "Cannot connect to the Docker daemon", run Docker Desktop GUI first.

Login to registry

docker login ic-registry.epfl.ch  # Use GASPAR credentials

Modify and publish
- Edit docker/Dockerfile as needed
- Use docker/publish.sh to build and push
- Important: Rename your image (e.g., mlo/<your-username>:tag) to avoid overwriting the default

Example workflow:

docker build . -t <your-tag>
docker tag <your-tag> ic-registry.epfl.ch/mlo/<your-tag>
docker push ic-registry.epfl.ch/mlo/<your-tag>

See also Matteo's custom Docker example.

Port Forwarding

To access services running in your pod (e.g., Jupyter):

kubectl get pods
kubectl port-forward <pod_name> 8888:8888

Then access at http://localhost:8888

Distributed Training

For multi-node training across several compute nodes, see the detailed guide:

Documentation: docs/multinode.md
Official docs: https://docs.run.ai/v2.13/Researcher/cli-reference/runai-submit-dist-pytorch/

Reference

File Overview

├── csub.py                # Job submission wrapper (wraps runai submit)
├── utils.py               # Python helpers for csub.py
├── user.env.example       # Template for .env (copy and configure)
├── docker/
│   ├── Dockerfile         # uv-enabled base image (RCP template)
│   ├── entrypoint.sh      # Runtime bootstrap script
│   └── publish.sh         # Build and push Docker images
├── kubeconfig.yaml        # Kubeconfig template for ~/.kube/config
└── docs/
    ├── faq.md             # Frequently asked questions
    ├── managing_workflows.md  # Day-to-day operations guide
    ├── README.md          # Architecture deep dive
    ├── runai_cli.md       # Alternative run:ai CLI workflows
    ├── multinode.md       # Multi-node/distributed training
    └── how_to_use_k8s_secret.md  # Kubernetes secrets reference

Deep Dive: How This Setup Works

For technical details about the Docker image, entrypoint script, environment variables, and secret management:

Read the architecture explainer: docs/README.md

Topics covered:

Runtime environment and entrypoint
Permissions model and shared caches
uv-based Python workflow
Images and publishing
Secrets, SSH, and Kubernetes integration

Quick Links

RCP Resources

run:ai Documentation

Official run:ai docs

Related Resources

Compute and Storage @ CLAIRE – Similar setup by colleagues

MLO Cluster Repositories (OUTDATED)

These repositories contain shared tooling and infrastructure (by previous PhD students). Contact Martin for editor access. They are outdated and not maintained anymore.

epfml/epfml-utils – Python package for shared tooling (pip install epfml-utils)
epfml/mlocluster-setup – Base images and setup for semi-permanent machines

Name		Name	Last commit message	Last commit date
Latest commit History 78 Commits
docker		docker
docs		docs
.gitignore		.gitignore
README.md		README.md
csub.py		csub.py
kubeconfig.yaml		kubeconfig.yaml
user.env.example		user.env.example
utils.py		utils.py

epfml/getting-started

Folders and files

Latest commit

History

Repository files navigation

MLO: Getting Started with the EPFL RCP Cluster

Overview

Getting Help

Quick Start

Setup Guide

1. Pre-setup (Access & Repository)

2. Setup Tools on Your Machine

Install kubectl

Setup kubeconfig

Install run:ai CLI

3. Login to the Cluster

Login to run:ai

Verify access

Verify Kubernetes connection

4. Configure Your .env File

Clone and create .env

Fill in required fields

Find your LDAP UID

Optional: Add secrets and tokens

Sync your secret

5. Start Your First Job

Start an interactive pod

Wait for the pod to start

Connect to your pod

6. Clone and Run Your Code

Clone your repository

Set up your Python environment

Run your code

Using VS Code

Recommended Workflow

csub.py Usage and Arguments

Basic Usage

Common Examples

Available Arguments

Secret Management

Advanced Options

After Submission

Advanced Topics

Managing Workflows

Alternative Workflows

Creating Custom Docker Images

Port Forwarding

Distributed Training

Reference

File Overview

Deep Dive: How This Setup Works

Quick Links

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 9

Uh oh!

Languages

4. Configure Your `.env` File

Clone and create `.env`

`csub.py` Usage and Arguments

Packages