Pinned Loading
-
ellamind/inference-hive
ellamind/inference-hive Publicinference-hive is a toolkit to run distributed LLM inference on SLURM clusters. Configure a few cluster, inference server and data settings, and scale your inference workload across thousands of GPUs.
-
openreviewer
openreviewer PublicGenerate high-quality peer reviews of machine learning and AI conference papers.
Python 3
-
SLURM PyTorch NCCL Multi-Node Test S...
SLURM PyTorch NCCL Multi-Node Test Script: A SLURM batch script that tests PyTorch's NCCL functionality across multiple GPU nodes. The script sets up a distributed PyTorch environment using torchrun and runs a comprehensive test that verifies NCCL initialization, inter-process communication barriers, and proper cleanup. Includes diagnostic output for troubleshooting multi-node GPU communication issues in HPC environments. 1#!/bin/bash
2#SBATCH --job-name=pytorch-nccl-test
3#SBATCH --partition=
4#SBATCH --account=
5#SBATCH --qos=
-
re-shard parquet dataset
re-shard parquet dataset 1from pathlib import Path
23import pyarrow.parquet as pq
4import pyarrow.dataset as pads
5
Something went wrong, please refresh the page to try again.
If the problem persists, check the GitHub status page or contact support.
If the problem persists, check the GitHub status page or contact support.