CUDA 12-first backend inference for Unsloth on Kaggle — Optimized for small GGUF models (1B-5B) on dual Tesla T4 GPUs (15GB each, SM 7.5). Built-in C++ libraries (llama.cpp llama-server, NVIDIA NCCL). Split-GPU architecture: GPU 0 for LLM inference, GPU 1 for Graphistry dashboard visualization of internal neural network architecture.
🌐 Official Documentation | 📖 Tutorial Notebooks | 🚀 Quick Start | 🔧 API Reference
- Installation
- Quick Start
- Multi-GPU Inference
- Unsloth Integration
- Split-GPU Architecture
- Features
- Performance
- Tutorial Notebooks
- Documentation
- Requirements
!pip install -q --no-cache-dir --force-reinstall git+https://github.com/llcuda/llcuda.git@v2.2.0Distribution Strategy:
- ✅ GitHub (Primary): Direct pip install from repository
- ✅ HuggingFace (Mirror): Alternative at
waqasm86/llcuda - ❌ NOT on PyPI/piwheels - We do not publish to PyPI
Package Details:
- Python code: ~62 KB (lightweight package)
- Built-in binaries: ~961 MB (llama.cpp + NCCL, auto-downloaded on first import from GitHub Releases)
import llcuda
print(f"llcuda {llcuda.__version__}") # 2.2.0📘 Full Installation Guide → | 🎯 Platform: Kaggle only (2× Tesla T4)
- Platform: Kaggle notebook
- GPUs: 2× Tesla T4 (15GB VRAM each, SM 7.5)
- Model Range: 1B-5B parameters (GGUF Q4_K_M quantization)
- Settings: Internet enabled, GPU T4 × 2 selected
import llcuda
from huggingface_hub import hf_hub_download
# Download small GGUF model (1B-5B range)
model_path = hf_hub_download(
repo_id="unsloth/gemma-3-1b-it-GGUF",
filename="gemma-3-1b-it-Q4_K_M.gguf",
local_dir="/kaggle/working/models"
)
# Load on GPU 0 (15GB VRAM)
engine = llcuda.InferenceEngine()
engine.load_model(model_path, silent=True)
result = engine.infer("What is AI?", max_tokens=100)
print(result.text)from llcuda.server import ServerManager
# Start llama-server on GPU 0 (100% allocation)
server = ServerManager()
server.start_server(
model_path=model_path,
gpu_layers=99,
tensor_split="1.0,0.0", # 100% GPU 0, 0% GPU 1
flash_attn=1,
)
# GPU 1 now available for Graphistry visualization
# See Notebook 11 for complete visualization workflow📘 Quick Start Guide → | 📓 Notebook 01 →
┌─────────────────────────────────────────────────────────────────┐
│ KAGGLE DUAL T4 SPLIT-GPU ARCHITECTURE (Optimized) │
├─────────────────────────────────────────────────────────────────┤
│ │
│ GPU 0: Tesla T4 (15GB VRAM, SM 7.5) │
│ ├─ llama.cpp llama-server (C++) │
│ ├─ GGUF Model: 1B-5B params (Q4_K_M) │
│ ├─ VRAM Usage: ~2-6 GB │
│ ├─ Built-in: FlashAttention, CUDA Graphs │
│ └─ tensor-split: "1.0,0.0" (100% GPU 0) │
│ │
│ GPU 1: Tesla T4 (15GB VRAM, SM 7.5) │
│ ├─ Graphistry[ai] Python SDK │
│ ├─ RAPIDS cuGraph (GPU-accelerated PageRank) │
│ ├─ Neural Network Visualization (929 nodes) │
│ ├─ VRAM Usage: ~0.5-2 GB │
│ └─ Free VRAM: ~13 GB for analytics │
│ │
│ Built-in C++ Libraries: llama.cpp + NVIDIA NCCL │
│ │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ KAGGLE DUAL T4 TENSOR-SPLIT (For models >15GB VRAM) │
├─────────────────────────────────────────────────────────────────┤
│ │
│ GPU 0: Tesla T4 (15GB) GPU 1: Tesla T4 (15GB) │
│ ├─ Model Layers 0-39 ├─ Model Layers 40-79 │
│ └─ ~14GB VRAM └─ ~14GB VRAM │
│ │
│ ← tensor-split 0.5,0.5 (NCCL-based) → │
│ │
│ Total: 30GB VRAM for models up to 70B (IQ3_XS) │
│ Note: Not recommended for 1B-5B models (use split-GPU) │
│ │
└─────────────────────────────────────────────────────────────────┘
./bin/llama-server \
-m model.gguf \
-ngl 99 \
--tensor-split 0.5,0.5 \
--split-mode layer \
-fa \
--host 0.0.0.0 \
--port 8080from llcuda.server import ServerManager
from llcuda.api.multigpu import kaggle_t4_dual_config
from llcuda.api.client import LlamaCppClient
# Get optimized configuration for Kaggle dual T4
config = kaggle_t4_dual_config()
# Start server with multi-GPU configuration
server = ServerManager()
tensor_split_str = ",".join(str(x) for x in config.tensor_split)
server.start_server(
model_path="model.gguf",
gpu_layers=config.n_gpu_layers,
tensor_split=tensor_split_str,
split_mode="layer",
flash_attn=1 if config.flash_attention else 0,
)
# Use OpenAI-compatible API
client = LlamaCppClient("http://localhost:8080")
response = client.chat.create(
messages=[{"role": "user", "content": "Hello!"}],
max_tokens=100
)
print(response.choices[0].message.content)Note: llama.cpp uses native CUDA tensor-split, NOT NCCL. NCCL is available for PyTorch distributed workloads.
Complete workflow from fine-tuning to deployment:
# ═══════════════════════════════════════════════════════════════
# STEP 1: Fine-tune with Unsloth
# ═══════════════════════════════════════════════════════════════
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Qwen2.5-1.5B-Instruct",
max_seq_length=2048,
load_in_4bit=True,
)
# Add LoRA and train...
# ═══════════════════════════════════════════════════════════════
# STEP 2: Export to GGUF
# ═══════════════════════════════════════════════════════════════
model.save_pretrained_gguf(
"my_model",
tokenizer,
quantization_method="q4_k_m" # Recommended for T4
)
# ═══════════════════════════════════════════════════════════════
# STEP 3: Deploy with llcuda
# ═══════════════════════════════════════════════════════════════
from llcuda.server import ServerManager, ServerConfig
server = ServerManager()
server.start_with_config(ServerConfig(
model_path="my_model-Q4_K_M.gguf",
n_gpu_layers=99,
tensor_split="0.5,0.5", # Dual T4
flash_attn=True,
))Run LLM inference on GPU 0 while using GPU 1 for RAPIDS/Graphistry analytics:
┌─────────────────┐ ┌─────────────────┐
│ GPU 0 (T4) │ │ GPU 1 (T4) │
├─────────────────┤ ├─────────────────┤
│ llama-server │ │ RAPIDS cuDF │
│ LLM Inference │ │ cuGraph │
│ ~5-12 GB │ │ Graphistry │
└─────────────────┘ └─────────────────┘
from llcuda import SplitGPUConfig
config = SplitGPUConfig(llm_gpu=0, graph_gpu=1)
# GPU 0: llama-server (LLM inference)
# GPU 1: RAPIDS cuGraph (graph visualization)Visualize your GGUF models as interactive graphs with Notebook 11:
┌─────────────────────────────────────────────────────────────────┐
│ GGUF NEURAL NETWORK ARCHITECTURE VISUALIZATION │
├─────────────────────────────────────────────────────────────────┤
│ │
│ 📊 929 Nodes: Complete Llama-3.2-3B structure │
│ 🔗 981 Edges: All connections and data flows │
│ 🎯 896 Attention Heads: Multi-head attention visualized │
│ 📦 112 Quantization Blocks: Q4_K_M structure revealed │
│ 🌐 Interactive Graphistry Dashboards: Cloud + offline HTML │
│ │
│ ✨ First comprehensive GGUF visualization tool │
│ ✨ GPU-accelerated graph analytics (PageRank, centrality) │
│ ✨ Dual-GPU architecture (inference + visualization) │
│ ✨ Multi-scale: From overview to individual attention heads │
│ │
└─────────────────────────────────────────────────────────────────┘
What You Can Visualize:
- Layer-by-layer transformer structure (35 nodes per layer)
- Attention head importance and connectivity
- Quantization block memory layout
- Information flow through the network
- Critical components via PageRank analysis
📘 GGUF Visualization Guide → | 📓 Notebook 11 → | 📓 Notebook 12 → | 📓 Notebook 13 →
| Feature | Description |
|---|---|
| Kaggle-Optimized | Built specifically for Kaggle dual Tesla T4 (15GB × 2, SM 7.5) |
| Small Models | Optimized for 1B-5B params GGUF (Q4_K_M) on single T4 |
| Split-GPU | GPU 0: LLM inference, GPU 1: Graphistry visualization |
| Built-in C++ Libraries | llama.cpp llama-server + NVIDIA NCCL (no compilation needed) |
| FlashAttention | Built-in for all quantizations (2× speedup) |
| Unsloth Backend | CUDA 12-first inference for Unsloth-trained models |
| Graphistry Dashboards | Interactive neural network visualization (929 nodes) |
| OpenAI API | Full llama.cpp server compatibility |
| GGUF Tools | Parse, quantize, analyze GGUF files |
| Auto-download | 62KB package, 961MB binaries from GitHub Releases |
| Model | Size | Quantization | VRAM | Tokens/sec | Recommended |
|---|---|---|---|---|---|
| Gemma-3 1B | 1.0B | Q4_K_M | ~1.2 GB | ~50 tok/s | ⭐ Best for fast inference |
| Llama-3.2 1B | 1.2B | Q4_K_M | ~1.3 GB | ~48 tok/s | ⭐ Excellent quality |
| Gemma-2 2B | 2.0B | Q4_K_M | ~1.8 GB | ~45 tok/s | ⭐ Balanced |
| Qwen2.5 3B | 3.0B | Q4_K_M | ~2.3 GB | ~40 tok/s | ⭐ High quality |
| Llama-3.2 3B | 3.2B | Q4_K_M | ~2.5 GB | ~38 tok/s | ⭐ Very capable |
| Gemma-3 4B | 4.0B | Q4_K_M | ~3.0 GB | ~35 tok/s | ⭐ Best quality |
All tested on single Tesla T4 (15GB VRAM, SM 7.5) with FlashAttention enabled
Configuration: GPU 0 for LLM, GPU 1 for Graphistry
GPU 0 Usage:
├─ 1B model: ~1.2 GB → 13.8 GB free
├─ 2B model: ~1.8 GB → 13.2 GB free
├─ 3B model: ~2.5 GB → 12.5 GB free
├─ 4B model: ~3.0 GB → 12.0 GB free
└─ 5B model: ~3.8 GB → 11.2 GB free
GPU 1 Available:
├─ Graphistry: ~0.5-2 GB
├─ RAPIDS cuGraph: ~0.3 GB
└─ Free for analytics: ~13 GB
13 comprehensive Kaggle-ready tutorials in notebooks/:
| # | Notebook | Description |
|---|---|---|
| 01 | Quick Start | 5-minute introduction |
| 02 | Server Setup | Advanced server configuration |
| 03 | Multi-GPU | Dual T4 tensor-split |
| 04 | GGUF Quantization | Complete quantization guide |
| 05 | Unsloth Integration | Train → Export → Deploy |
| 06 | Split-GPU + Graphistry | LLM + RAPIDS analytics |
| 07 | Knowledge Graph Extraction | LLM-powered entity extraction + Graphistry |
| 08 | Document Network Analysis | Document similarity + GPU analytics |
| 09 | Large Models | Deploy 13B+ on dual T4 |
| 10 | Complete Workflow | Production end-to-end pipeline |
| 11 | GGUF Visualization | ⭐ Interactive architecture graphs |
| 12 | Attention Mechanism Explorer | Q-K-V attention patterns + Graphistry |
| 13 | Token Embedding Visualizer | 3D embedding space + Plotly UMAP |
| Document | Description |
|---|---|
| QUICK_START.md | Get started in 5 minutes |
| INSTALL.md | Detailed installation guide |
| CHANGELOG.md | Version history |
| Document | Description |
|---|---|
| docs/INSTALLATION.md | Complete installation reference |
| docs/CONFIGURATION.md | Server & client configuration |
| docs/API_REFERENCE.md | Python API documentation |
| docs/KAGGLE_GUIDE.md | Kaggle-specific guide |
| docs/GGUF_GUIDE.md | GGUF format & quantization |
| docs/TROUBLESHOOTING.md | Common issues & solutions |
| Document | Description |
|---|---|
| CONTRIBUTING.md | How to contribute |
| docs/BUILD_GUIDE.md | Building from source |
- Platform: Kaggle notebooks (https://kaggle.com/code)
- GPUs: 2× Tesla T4 (15GB VRAM each, Compute Capability SM 7.5)
- Python: 3.11+ (pre-installed on Kaggle)
- CUDA: 12.x (pre-installed on Kaggle)
- Accelerator: GPU T4 × 2 (must select dual T4)
- Internet: Enabled (for package installation)
- Persistence: Enabled (for downloaded models)
- Size: 1B-5B parameters recommended
- Format: GGUF (from HuggingFace)
- Quantization: Q4_K_M (best quality/speed balance)
- Source: Unsloth-compatible models preferred
Note: llcuda v2.2.0 is designed and tested exclusively for Kaggle dual T4 environment. Other platforms are not officially supported.
| File | Size | Platform |
|---|---|---|
llcuda-v2.2.0-cuda12-kaggle-t4x2.tar.gz |
961 MB | Kaggle 2× T4 |
Build Info:
- CUDA 12.5, SM 7.5 (Turing)
- llama.cpp b7760 (commit 388ce82)
- Build Date: 2026-01-16
- Contents: 13 binaries (llama-server, llama-cli, llama-quantize, etc.)
We welcome contributions! See CONTRIBUTING.md for guidelines.
# Development setup
git clone https://github.com/llcuda/llcuda.git
cd llcuda
pip install -e ".[dev]"
pytest tests/MIT — see LICENSE
Complete tutorial series for llcuda v2.2.0 on Kaggle dual T4 GPUs. Click the badges to open directly in Kaggle or view on GitHub.
| # | Notebook | Open in Kaggle | Description |
|---|---|---|---|
| 01 | Quick Start | 5-minute introduction to llcuda | |
| 02 | Llama Server Setup | Server configuration & lifecycle | |
| 03 | Multi-GPU Inference | Dual T4 tensor-split configuration | |
| 04 | GGUF Quantization | K-quants, I-quants, GGUF parsing | |
| 05 | Unsloth Integration | Fine-tune → GGUF → Deploy | |
| 06 | Split-GPU + Graphistry | LLM on GPU 0 + RAPIDS on GPU 1 | |
| 07 | Knowledge Graph Extraction | LLM entity extraction + graph visualization | |
| 08 | Document Network Analysis | Document similarity networks with GPU analytics | |
| 09 | Large Models (13B+) | Deploy large models on dual T4 with tensor-split | |
| 10 | Complete Workflow | Production end-to-end: Setup → Model → Server → Analytics → API | |
| 11 | GGUF Visualization ⭐ | MOST IMPORTANT: Dual-GPU architecture visualization with 8 interactive dashboards | |
| 12 | Attention Mechanism Explorer | Q-K-V attention patterns across all heads with Graphistry dashboards | |
| 13 | Token Embedding Visualizer | 3D embedding space exploration with GPU-accelerated UMAP + Plotly |
| Path | Notebooks | Time | Focus |
|---|---|---|---|
| Quick Start | 01 → 02 → 03 | 1 hour | Get running fast |
| Full Course | 01 → 13 (all) | 5.5 hours | Complete mastery + visualization |
| Unsloth Focus | 01 → 04 → 05 → 10 | 2 hours | Fine-tuning workflow |
| Large Models | 01 → 03 → 09 | 1.5 hours | 70B on Kaggle |
| Visualization ⭐ | 01 → 03 → 04 → 06 → 11 → 12 → 13 | 3.5 hours | Architecture + attention + embeddings |