| Llama 7b | Mistral 7b | CodeLlama 34b | Llama 7b Kaggle 2x T4 |
|---|---|---|---|
| 2.2x faster, -43% VRAM | 2.2x faster, -62% VRAM | 1.9x faster, -27% VRAM | 5.5x faster, -44% VRAM |
| Colab Alpaca example + inference | Colab T4 example | A100 example | Kaggle Alpaca example |
| Colab A100 example | Colab A100 example | (59 more examples if you scroll down) | Kaggle Slim Orca |
- Supports Llama (7, 13, 70b), Yi (6, 34b), Mistral (7b), Tinyllama, CodeLlama (7, 13, 34b), and all Llama / Mistral derived architectures!
- All kernels written in OpenAI's Triton language.
- 0% loss in accuracy - no approximation methods - all exact.
- No change of hardware necessary. Supports NVIDIA GPUs since 2018+. Minimum CUDA Compute Capability 7.0 (V100, T4, Titan V, RTX 20, 30, 40x, A100, H100, L40 etc) Check your GPU
- NEW! Works on Linux and Windows via WSL.
- NEW! Experimental support for DPO (Direct Preference Optimization)!
- Supports 4bit and 16bit QLoRA / LoRA finetuning via bitsandbytes.
- Open source version trains 5x faster or you can check out Unsloth Pro and Max codepaths for 30x faster training!
| 1 A100 40GB | Hugging Face | Flash Attention 2 | Unsloth Open | Unsloth Equal | Unsloth Pro | Unsloth Max |
|---|---|---|---|---|---|---|
| Alpaca | 1x | 1.04x | 1.98x | 2.48x | 5.32x | 15.64x |
| LAION Chip2 | 1x | 0.92x | 1.61x | 1.84x | 7.05x | 20.73x |
| OASST | 1x | 1.19x | 2.17x | 2.66x | 5.04x | 14.83x |
| Slim Orca | 1x | 1.18x | 2.22x | 2.64x | 5.04x | 14.82x |
Join our Discord!
If you trained a model with Unsloth, we made a cool sticker!!
Unsloth currently only supports Linux distros and Pytorch == 2.1.
conda install cudatoolkit xformers bitsandbytes pytorch pytorch-cuda=12.1 \
-c pytorch -c nvidia -c xformers -c conda-forge -y
pip install "unsloth[kaggle] @ git+https://github.com/unslothai/unsloth.git"- Find your CUDA version via
import torch; torch.version.cuda- We only support Pytorch 2.1 (2.1.1 bugs out for now): You can update Pytorch via Pip (interchange cu121 / cu118)
pip install --upgrade --force-reinstall --no-cache-dir torch==2.1.0 triton \
--index-url https://download.pytorch.org/whl/cu121- Select either cu118 for CUDA 11.8 or cu121 for CUDA 12.1. If you have a RTX 3060 or higher (A100, H100 etc), use the "ampere" path.
pip install "unsloth[cu118] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu121] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu118_ampere] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu121_ampere] @ git+https://github.com/unslothai/unsloth.git"Change cu121 to cu118 for CUDA version 11.8 or 12.1. Go to https://pytorch.org/ to learn more.
- If you get errors, try the below first, then go back to step 1:
pip install --upgrade pipWe support Huggingface's TRL, Trainer, Seq2SeqTrainer or even Pytorch code!
from unsloth import FastLlamaModel, FastMistralModel
import torch
max_seq_length = 2048 # Can change to any number <= 4096
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.
# Load Llama model
model, tokenizer = FastLlamaModel.from_pretrained(
model_name = "unsloth/llama-2-7b", # Supports any llama model eg meta-llama/Llama-2-7b-hf
max_seq_length = max_seq_length,
dtype = dtype,
load_in_4bit = load_in_4bit,
# token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)
# Do model patching and add fast LoRA weights
model = FastLlamaModel.get_peft_model(
model,
r = 16,
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",],
lora_alpha = 16,
lora_dropout = 0, # Currently only supports dropout = 0
bias = "none", # Currently only supports bias = "none"
use_gradient_checkpointing = True,
random_state = 3407,
max_seq_length = max_seq_length,
)
trainer = .... Use Huggingface's Trainer and dataset loading (TRL, transformers etc)152334H hacked Unsloth to work with DPO via TRL!
- Hack the model's
config.jsonto be llama model. Example gist. - Use Unsloth for DPO for both base and reference models. Example gist.
- Support Mixtral.
- Does not support non Llama models - we do so in the future.
Time taken for 1 epoch
One Tesla T4 on Google Colab
bsz = 2, ga = 4, max_grad_norm = 0.3, num_train_epochs = 1, seed = 3047, lr = 2e-4, wd = 0.01, optim = "adamw_8bit", schedule = "linear", schedule_steps = 10
| System | GPU | Alpaca (52K) | LAION OIG (210K) | Open Assistant (10K) | SlimOrca (518K) |
|---|---|---|---|---|---|
| Huggingface | 1 T4 | 23h 15m | 56h 28m | 8h 38m | 391h 41m |
| Unsloth Open | 1 T4 | 13h 7m (1.8x) | 31h 47m (1.8x) | 4h 27m (1.9x) | 240h 4m (1.6x) |
| Unsloth Pro | 1 T4 | 3h 6m (7.5x) | 5h 17m (10.7x) | 1h 7m (7.7x) | 59h 53m (6.5x) |
| Unsloth Max | 1 T4 | 2h 39m (8.8x) | 4h 31m (12.5x) | 0h 58m (8.9x) | 51h 30m (7.6x) |
Peak Memory Usage
| System | GPU | Alpaca (52K) | LAION OIG (210K) | Open Assistant (10K) | SlimOrca (518K) |
|---|---|---|---|---|---|
| Huggingface | 1 T4 | 7.3GB | 5.9GB | 14.0GB | 13.3GB |
| Unsloth Open | 1 T4 | 6.8GB | 5.7GB | 7.8GB | 7.7GB |
| Unsloth Pro | 1 T4 | 6.4GB | 6.4GB | 6.4GB | 6.4GB |
| Unsloth Max | 1 T4 | 11.4GB | 12.4GB | 11.9GB | 14.4GB |
Time taken for 1 epoch
Two Tesla T4s on Kaggle
bsz = 2, ga = 4, max_grad_norm = 0.3, num_train_epochs = 1, seed = 3047, lr = 2e-4, wd = 0.01, optim = "adamw_8bit", schedule = "linear", schedule_steps = 10
| System | GPU | Alpaca (52K) | LAION OIG (210K) | Open Assistant (10K) | SlimOrca (518K) * |
|---|---|---|---|---|---|
| Huggingface | 2 T4 | 84h 47m | 163h 48m | 30h 51m | 1301h 24m * |
| Unsloth Pro | 2 T4 | 3h 20m (25.4x) | 5h 43m (28.7x) | 1h 12m (25.7x) | 71h 40m (18.1x) * |
| Unsloth Max | 2 T4 | 3h 4m (27.6x) | 5h 14m (31.3x) | 1h 6m (28.1x) | 54h 20m (23.9x) * |
Peak Memory Usage on a Multi GPU System (2 GPUs)
| System | GPU | Alpaca (52K) | LAION OIG (210K) | Open Assistant (10K) | SlimOrca (518K) * |
|---|---|---|---|---|---|
| Huggingface | 2 T4 | 8.4GB | 6GB | 7.2GB | 5.3GB | 14.3GB | 6.6GB | 10.9GB | 5.9GB * |
| Unsloth Pro | 2 T4 | 7.7GB | 4.9GB | 7.5GB | 4.9GB | 8.5GB | 4.9GB | 6.2GB | 4.7GB * |
| Unsloth Max | 2 T4 | 10.5GB | 5GB | 10.6GB | 5GB | 10.6GB | 5GB | 10.5GB | 5GB * |
- Slim Orca
bsz=1for all benchmarks sincebsz=2OOMs. We can handlebsz=2, but we benchmark it withbsz=1for consistency.
Click "Code" for a fully reproducible example. "Unsloth Equal" is a preview of our PRO version, with code stripped out. All settings and the loss curve remains identical.
| 1 A100 40GB | Hugging Face | Flash Attention 2 | Unsloth Open | Unsloth Equal | Unsloth Pro | Unsloth Max |
|---|---|---|---|---|---|---|
| Alpaca | 1x | 1.04x | 1.98x | 2.48x | 5.32x | 15.64x |
| code | Code | Code | Code | Code | ||
| seconds | 1040 | 1001 | 525 | 419 | 196 | 67 |
| memory MB | 18235 | 15365 | 9631 | 8525 | ||
| % saved | 15.74 | 47.18 | 53.25 |
| 1 A100 40GB | Hugging Face | Flash Attention 2 | Unsloth Open | Unsloth Equal | Unsloth Pro | Unsloth Max |
|---|---|---|---|---|---|---|
| LAION Chip2 | 1x | 0.92x | 1.61x | 1.84x | 7.05x | 20.73x |
| code | Code | Code | Code | Code | ||
| seconds | 581 | 631 | 361 | 315 | 82 | 28 |
| memory MB | 7763 | 8047 | 7763 | 6441 | ||
| % saved | -3.66 | 0.00 | 17.03 |
| 1 A100 40GB | Hugging Face | Flash Attention 2 | Unsloth Open | Unsloth Equal | Unsloth Pro | Unsloth Max |
|---|---|---|---|---|---|---|
| OASST | 1x | 1.19x | 2.17x | 2.66x | 5.04x | 14.83x |
| code | Code | Code | Code | Code | ||
| seconds | 1852 | 1558 | 852 | 696 | 367 | 125 |
| memory MB | 26431 | 16565 | 12267 | 11223 | ||
| % saved | 37.33 | 53.59 | 57.54 |
| 1 A100 40GB | Hugging Face | Flash Attention 2 | Unsloth Open | Unsloth Equal | Unsloth Pro | Unsloth Max |
|---|---|---|---|---|---|---|
| Slim Orca | 1x | 1.18x | 2.22x | 2.64x | 5.04x | 14.82x |
| code | Code | Code | Code | Code | ||
| seconds | 1824 | 1545 | 821 | 691 | 362 | 123 |
| memory MB | 24557 | 15681 | 10595 | 9007 | ||
| % saved | 36.14 | 56.86 | 63.32 |
| 1 A100 40GB | Hugging Face | Flash Attention 2 | Unsloth Open | Unsloth Equal | Unsloth Pro | Unsloth Max |
|---|---|---|---|---|---|---|
| Mistral 7B Slim Orca | 1x | 1.15x | 2.15x | 2.53x | 4.61x | 13.69x |
| code | Code | Code | Code | Code | ||
| seconds | 1813 | 1571 | 842 | 718 | 393 | 132 |
| memory MB | 32853 | 19385 | 12465 | 10271 | ||
| % saved | 40.99 | 62.06 | 68.74 |
| 1 A100 40GB | Hugging Face | Flash Attention 2 | Unsloth Open | Unsloth Equal | Unsloth Pro | Unsloth Max |
|---|---|---|---|---|---|---|
| Code Llama 34B | OOM ❌ | 0.99x | 1.87x | 2.61x | 4.27x | 12.82x |
| code | Code | Code | Code | Code | ||
| seconds | 1953 | 1982 | 1043 | 748 | 458 | 152 |
| memory MB | 40000 | 33217 | 27413 | 22161 | ||
| % saved | 16.96 | 31.47 | 44.60 |
| 1 T4 16GB | Hugging Face | Flash Attention | Unsloth Open | Unsloth Pro Equal | Unsloth Pro | Unsloth Max |
|---|---|---|---|---|---|---|
| Alpaca | 1x | 1.09x | 1.69x | 1.79x | 2.93x | 8.3x |
| code | Code | Code | Code | Code | ||
| seconds | 1599 | 1468 | 942 | 894 | 545 | 193 |
| memory MB | 7199 | 7059 | 6459 | 5443 | ||
| % saved | 1.94 | 10.28 | 24.39 |
| 1 T4 16GB | Hugging Face | Flash Attention | Unsloth Open | Unsloth Pro Equal | Unsloth Pro | Unsloth Max |
|---|---|---|---|---|---|---|
| LAION Chip2 | 1x | 0.99x | 1.80x | 1.75x | 4.15x | 11.75x |
| code | Code | Code | Code | Code | ||
| seconds | 952 | 955 | 529 | 543 | 229 | 81 |
| memory MB | 6037 | 6033 | 5797 | 4855 | ||
| % saved | 0.07 | 3.98 | 19.58 |
| 1 T4 16GB | Hugging Face | Flash Attention | Unsloth Open | Unsloth Pro Equal | Unsloth Pro | Unsloth Max |
|---|---|---|---|---|---|---|
| OASST | 1x | 1.19x | 1.95x | 1.86x | 2.58x | 7.3x |
| code | Code | Code | Code | Code | ||
| seconds | 2640 | 2222 | 1355 | 1421 | 1024 | 362 |
| memory MB | 14827 | 10391 | 8413 | 7031 | ||
| % saved | 29.92 | 43.26 | 52.58 |
| 1 T4 16GB | Hugging Face | Flash Attention | Unsloth Open | Unsloth Pro Equal | Unsloth Pro | Unsloth Max |
|---|---|---|---|---|---|---|
| Slim Orca | 1x | 1.21x | 1.77x | 1.85x | 2.71x | 7.67x |
| code | Code | Code | Code | Code | ||
| seconds | 2735 | 2262 | 1545 | 1478 | 1009 | 356 |
| memory MB | 13933 | 10489 | 7661 | 6563 | ||
| % saved | 24.72 | 45.02 | 52.90 |
| 2 T4 DDP | Hugging Face | Flash Attention | Unsloth Open | Unsloth Equal | Unsloth Pro | Unsloth Max |
|---|---|---|---|---|---|---|
| Alpaca | 1x | 0.99x | 4.95x | 4.44x | 7.28x | 20.61x |
| code | Code | Code | Code | |||
| seconds | 9882 | 9946 | 1996 | 2227 | 1357 | 480 |
| memory MB | 9176 | 9128 | 6904 | 6782 | ||
| % saved | 0.52 | 24.76 | 26.09 |
| 2 T4 DDP | Hugging Face | Flash Attention | Unsloth Open | Unsloth Equal | Unsloth Pro | Unsloth Max |
|---|---|---|---|---|---|---|
| LAION Chip2 | 1x | 1.12x | 5.28x | 4.21x | 10.01x | 28.32x |
| code | Code | Code | Code | |||
| seconds | 5418 | 4854 | 1027 | 1286 | 541 | 191 |
| memory MB | 7316 | 7316 | 5732 | 5934 | ||
| % saved | 0.00 | 21.65 | 18.89 |
| 2 T4 DDP | Hugging Face | Flash Attention | Unsloth Open | Unsloth Equal | Unsloth Pro | Unsloth Max |
|---|---|---|---|---|---|---|
| OASST (bsz=1) | 1x | 1.14x | 5.56x | 5.09x | 5.64x | 15.97x |
| code | Code | Code | Code | |||
| seconds | 4503 | 3955 | 811 | 885 | 798 | 282 |
| memory MB | 11896 | 11628 | 6616 | 7105 | ||
| % saved | 2.25 | 44.38 | 40.27 |
| 2 T4 DDP | Hugging Face | Flash Attention | Unsloth Open | Unsloth Equal | Unsloth Pro | Unsloth Max |
|---|---|---|---|---|---|---|
| Slim Orca (bsz=1) | 1x | 0.97x | 5.54x | 4.68x | 6.88x | 19.46x |
| code | Code | Code | Code | |||
| seconds | 4042 | 4158 | 729 | 863 | 588 | 208 |
| memory MB | 11010 | 11042 | 6492 | 7410 | ||
| % saved | -0.29 | 41.04 | 32.70 |
| 2 T4 DDP | Hugging Face | Flash Attention | Unsloth Open | Unsloth Equal | Unsloth Pro | Unsloth Max |
|---|---|---|---|---|---|---|
| OASST (bsz=2) | OOM ❌ | OOM ❌ | ✓ | ✓ | ✓ | ✓ |
| code | Code | Code | Code | |||
| seconds | OOM | OOM | 2719 | 3391 | 2794 | 987 |
| memory MB | OOM | OOM | 8134 | 9600 | ||
| % saved | OOM | OOM |
| 2 T4 DDP | Hugging Face | Flash Attention | Unsloth Open | Unsloth Equal | Unsloth Pro | Unsloth Max |
|---|---|---|---|---|---|---|
| Slim Orca (bsz=2) | OOM ❌ | OOM ❌ | ✓ | ✓ | ✓ | ✓ |
| code | Code | Code | Code | |||
| seconds | OOM | OOM | 2990 | 3444 | 2351 | 831 |
| memory MB | OOM | OOM | 7594 | 8881 | ||
| % saved | OOM | OOM |
Manual autograd, Triton kernels etc. See our Benchmark Breakdown for more info!
- Sometimes
bitsandbytesorxformersdoes not link properly. Try running:
!ldconfig /usr/lib64-nvidia-
Windows is not supported as of yet - we rely on Xformers and Triton support, so until both packages support Windows officially, Unsloth will then support Windows.
-
If it doesn't install - maybe try updating
pip.
- RandomInternetPreson for confirming WSL support
- 152334H for experimental DPO support
- atgctg for syntax highlighting