This is a port of BlinkDL/RWKV-LM to ggerganov/ggml.
Besides the usual FP32, it supports FP16, quantized INT4, INT5 and INT8 inference. This project is focused on CPU, but cuBLAS is also supported.
This project provides a C library rwkv.h and a convinient Python wrapper for it. Additionally, it includes ReservoirRWKV - a Reservoir Computing implementation that uses RWKV as a reservoir layer, providing a ReservoirPy-compatible API.
RWKV is a large language model architecture. In contrast to Transformer with O(n^2) attention, RWKV requires only state from previous step to calculate logits. This makes RWKV very CPU-friendly on large context lenghts.
This project supports RWKV v4, v5, v6 and the latest v7 architectures.
Loading LoRA checkpoints in Blealtan's format is supported through merge_lora_into_ggml.py script.
This project includes ReservoirRWKV, a novel implementation of Reservoir Computing that uses RWKV models as reservoir layers. This provides:
- Echo State Network functionality with RWKV as the reservoir (fixed weights)
- ReservoirPy-compatible API for easy migration and integration
- Efficient sequential processing leveraging RWKV's O(n) complexity
- Ridge regression readout layers for various prediction tasks
- Time series prediction, memory tasks, and classification support
See docs/RESERVOIR_COMPUTING.md for detailed documentation and examples.
If you use rwkv.cpp for anything serious, please test all available formats for perplexity and latency on a representative dataset, and decide which trade-off is best for you.
Below table is for reference only. Measurements were made on 4C/8T x86 CPU with AVX2, 4 threads. The models are RWKV v4 Pile 169M, RWKV v4 Pile 1.5B.
| Format | Perplexity (169M) | Latency, ms (1.5B) | File size, GB (1.5B) |
|---|---|---|---|
Q4_0 |
17.507 | 76 | 1.53 |
Q4_1 |
17.187 | 72 | 1.68 |
Q5_0 |
16.194 | 78 | 1.60 |
Q5_1 |
15.851 | 81 | 1.68 |
Q8_0 |
15.652 | 89 | 2.13 |
FP16 |
15.623 | 117 | 2.82 |
FP32 |
15.623 | 198 | 5.64 |
Measurements were made on Intel i7 13700K & NVIDIA 3060 Ti 8 GB. The model is RWKV-4-Pile-169M, 12 layers were offloaded to GPU.
Latency per token in ms shown.
| Format | 1 thread | 2 threads | 4 threads | 8 threads | 24 threads |
|---|---|---|---|---|---|
Q4_0 |
7.9 | 6.2 | 6.9 | 8.6 | 20 |
Q4_1 |
7.8 | 6.7 | 6.9 | 8.6 | 21 |
Q5_1 |
8.1 | 6.7 | 6.9 | 9.0 | 22 |
| Format | 1 thread | 2 threads | 4 threads | 8 threads | 24 threads |
|---|---|---|---|---|---|
Q4_0 |
59 | 51 | 50 | 54 | 94 |
Q4_1 |
59 | 51 | 49 | 54 | 94 |
Q5_1 |
77 | 69 | 67 | 72 | 101 |
Note: since cuBLAS is supported only for ggml_mul_mat(), we still need to use few CPU resources to execute remaining operations.
Measurements were made on CPU AMD Ryzen 9 5900X & GPU AMD Radeon RX 7900 XTX. The model is RWKV-novel-4-World-7B-20230810-ctx128k, 32 layers were offloaded to GPU.
Latency per token in ms shown.
| Format | 1 thread | 2 threads | 4 threads | 8 threads | 24 threads |
|---|---|---|---|---|---|
f16 |
94 | 91 | 94 | 106 | 944 |
Q4_0 |
83 | 77 | 75 | 110 | 1692 |
Q4_1 |
85 | 80 | 85 | 93 | 1691 |
Q5_1 |
83 | 78 | 83 | 90 | 1115 |
Note: same as cuBLAS, hipBLAS only supports ggml_mul_mat(), we still need to use few CPU resources to execute remaining operations.
Requirements: git.
git clone --recursive https://github.com/saharNooby/rwkv.cpp.git
cd rwkv.cpp
Check out Releases, download appropriate ZIP for your OS and CPU, extract rwkv library file into the repository directory.
On Windows: to check whether your CPU supports AVX2 or AVX-512, use CPU-Z.
This option is recommended for maximum performance, because the library would be built specifically for your CPU and OS.
Requirements: CMake or CMake from anaconda, Build Tools for Visual Studio 2019.
cmake .
cmake --build . --config Release
If everything went OK, bin\Release\rwkv.dll file should appear.
Refer to docs/cuBLAS_on_Windows.md for a comprehensive guide.
Refer to docs/hipBLAS_on_Windows.md for a comprehensive guide.
Requirements: CMake (Linux: sudo apt install cmake, MacOS: brew install cmake, anaconoda: cmake package).
cmake .
cmake --build . --config Release
Anaconda & M1 users: please verify that CMAKE_SYSTEM_PROCESSOR: arm64 after running cmake . — if it detects x86_64, edit the CMakeLists.txt file under the # Compile flags to add set(CMAKE_SYSTEM_PROCESSOR "arm64").
If everything went OK, librwkv.so (Linux) or librwkv.dylib (MacOS) file should appear in the base repo folder.
cmake . -DRWKV_CUBLAS=ON
cmake --build . --config Release
If everything went OK, librwkv.so (Linux) or librwkv.dylib (MacOS) file should appear in the base repo folder.
Requirements: Python 3.x with PyTorch.
First, download a model from Hugging Face like this one.
Second, convert it into rwkv.cpp format using following commands:
# Windows
python python\convert_pytorch_to_ggml.py C:\RWKV-4-Pile-169M-20220807-8023.pth C:\rwkv.cpp-169M.bin FP16
# Linux / MacOS
python python/convert_pytorch_to_ggml.py ~/Downloads/RWKV-4-Pile-169M-20220807-8023.pth ~/Downloads/rwkv.cpp-169M.bin FP16
Optionally, quantize the model into one of quantized formats from the table above:
# Windows
python python\quantize.py C:\rwkv.cpp-169M.bin C:\rwkv.cpp-169M-Q5_1.bin Q5_1
# Linux / MacOS
python python/quantize.py ~/Downloads/rwkv.cpp-169M.bin ~/Downloads/rwkv.cpp-169M-Q5_1.bin Q5_1
Requirements: Python 3.x with numpy. If using Pile or Raven models, tokenizers is also required.
To generate some text, run:
# Windows
python python\generate_completions.py C:\rwkv.cpp-169M-Q5_1.bin
# Linux / MacOS
python python/generate_completions.py ~/Downloads/rwkv.cpp-169M-Q5_1.bin
To chat with a bot, run:
# Windows
python python\chat_with_bot.py C:\rwkv.cpp-169M-Q5_1.bin
# Linux / MacOS
python python/chat_with_bot.py ~/Downloads/rwkv.cpp-169M-Q5_1.bin
Edit generate_completions.py or chat_with_bot.py to change prompts and sampling settings.
The short and simple script inference_example.py demostrates the use of rwkv.cpp in Python.
To use rwkv.cpp in C/C++, include the header rwkv.h.
from rwkv_cpp import RWKVSharedLibrary, ReservoirRWKV
import numpy as np
# Initialize
library = RWKVSharedLibrary("librwkv.so")
reservoir = ReservoirRWKV(
shared_library=library,
model_path="model.bin",
units=256,
alpha=1e-4
)
# Train on sequences
X_train = [[1, 2, 3, 4], [5, 6, 7, 8]] # Token sequences
y_train = np.array([[0.1], [0.9]]) # Targets
reservoir.fit(X_train, y_train)
# Predict
predictions = reservoir.predict([1, 2, 3, 4])See python/reservoir_example.py for complete examples including time series prediction and memory tasks.
To use rwkv.cpp in any other language, see Bindings section below. If your language is missing, you can try to bind to the C API using the tooling provided by your language.
These projects wrap rwkv.cpp for easier use in other languages/frameworks.
- Golang: seasonjs/rwkv
- Node.js: Atome-FE/llama-node
ggml moves fast, and can occasionally break compatibility with older file formats.
rwkv.cpp will attempt it's best to explain why a model file can't be loaded and what next steps are available to the user.
For reference only, here is a list of latest versions of rwkv.cpp that have supported older formats. No support will be provided for these versions.
Q4_2, old layout of quantized formatsQ4_3,Q4_1_O
See also docs/FILE_FORMAT.md for version numbers of rwkv.cpp model files and their changelog.
Please follow the code style described in docs/CODE_STYLE.md.