- Beijing
Starred repositories
verl: Volcano Engine Reinforcement Learning for LLMs
Fully open reproduction of DeepSeek-R1
An Easy-to-use, Scalable and High-performance Agentic RL Framework based on Ray (PPO & DAPO & REINFORCE++ & TIS & vLLM & Ray & Async RL)
A fast communication-overlapping library for tensor/expert parallelism on GPUs.
DeepEP: an efficient expert-parallel communication library
Accessible large language models via k-bit quantization for PyTorch.
A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit and 4-bit floating point (FP8 and FP4) precision on Hopper, Ada and Blackwell GPUs, to provide better performance…
Zero Bubble Pipeline Parallelism
Use PEFT or Full-parameter to CPT/SFT/DPO/GRPO 600+ LLMs (Qwen3.5, DeepSeek-R1, GLM-5, InternLM3, Llama4, ...) and 300+ MLLMs (Qwen3-VL, Qwen3-Omni, InternVL3.5, Ovis2.5, GLM4.5v, Llava, Phi4, ...)…
tiktoken is a fast BPE tokeniser for use with OpenAI's models.
Prometheus exporter that mines /proc to report on selected processes
The official repo of Pai-Megatron-Patch for LLM & VLM large scale training developed by Alibaba Cloud.
Unified Efficient Fine-Tuning of 100+ LLMs & VLMs (ACL 2024)
北京电信IPTV播放列表 Beijing Telecom IPTV playlist bj-telecom-iptv.m3u
Official implementation of "Towards Efficient Visual Adaption via Structural Re-parameterization".
A tool for bandwidth measurements on NVIDIA GPUs.
Code for loralib, an implementation of "LoRA: Low-Rank Adaptation of Large Language Models"
Optimized primitives for collective multi-GPU communication
A GPU performance profiling tool for PyTorch models
Example models using DeepSpeed
Chinese-LLaMA 1&2、Chinese-Falcon 基础模型;ChatFlow中文对话模型;中文OpenLLaMA模型;NLP预训练/指令微调数据集
Code and documentation to train Stanford's Alpaca models, and generate the data.
GPT4All: Run Local LLMs on Any Device. Open-source and available for commercial use.
Ongoing research training transformer language models at scale, including: BERT & GPT-2