llcuda v2.2.0

CUDA 12-first backend inference for Unsloth on Kaggle — Optimized for small GGUF models (1B-5B) on dual Tesla T4 GPUs (15GB each, SM 7.5). Built-in C++ libraries (llama.cpp llama-server, NVIDIA NCCL). Split-GPU architecture: GPU 0 for LLM inference, GPU 1 for Graphistry dashboard visualization of internal neural network architecture.

🌐 Official Documentation | 📖 Tutorial Notebooks | 🚀 Quick Start | 🔧 API Reference

📖 Table of Contents

🚀 Installation

Quick Install (Kaggle Notebook)

!pip install -q --no-cache-dir --force-reinstall git+https://github.com/llcuda/llcuda.git@v2.2.0

Distribution Strategy:

✅ GitHub (Primary): Direct pip install from repository
✅ HuggingFace (Mirror): Alternative at waqasm86/llcuda
❌ NOT on PyPI/piwheels - We do not publish to PyPI

Package Details:

Python code: ~62 KB (lightweight package)
Built-in binaries: ~961 MB (llama.cpp + NCCL, auto-downloaded on first import from GitHub Releases)

Verify Installation

import llcuda
print(f"llcuda {llcuda.__version__}")  # 2.2.0

📘 Full Installation Guide → | 🎯 Platform: Kaggle only (2× Tesla T4)

⚡ Quick Start (Kaggle Dual T4)

Prerequisites

Platform: Kaggle notebook
GPUs: 2× Tesla T4 (15GB VRAM each, SM 7.5)
Model Range: 1B-5B parameters (GGUF Q4_K_M quantization)
Settings: Internet enabled, GPU T4 × 2 selected

Basic Inference (Single GPU 0)

import llcuda
from huggingface_hub import hf_hub_download

# Download small GGUF model (1B-5B range)
model_path = hf_hub_download(
    repo_id="unsloth/gemma-3-1b-it-GGUF",
    filename="gemma-3-1b-it-Q4_K_M.gguf",
    local_dir="/kaggle/working/models"
)

# Load on GPU 0 (15GB VRAM)
engine = llcuda.InferenceEngine()
engine.load_model(model_path, silent=True)
result = engine.infer("What is AI?", max_tokens=100)
print(result.text)

Split-GPU Architecture (GPU 0: LLM, GPU 1: Graphistry)

from llcuda.server import ServerManager

# Start llama-server on GPU 0 (100% allocation)
server = ServerManager()
server.start_server(
    model_path=model_path,
    gpu_layers=99,
    tensor_split="1.0,0.0",  # 100% GPU 0, 0% GPU 1
    flash_attn=1,
)

# GPU 1 now available for Graphistry visualization
# See Notebook 11 for complete visualization workflow

📘 Quick Start Guide → | 📓 Notebook 01 →

🎯 Split-GPU Architecture (Kaggle 2× T4)

Recommended: GPU 0 for LLM, GPU 1 for Graphistry

┌─────────────────────────────────────────────────────────────────┐
│         KAGGLE DUAL T4 SPLIT-GPU ARCHITECTURE (Optimized)       │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│   GPU 0: Tesla T4 (15GB VRAM, SM 7.5)                           │
│   ├─ llama.cpp llama-server (C++)                               │
│   ├─ GGUF Model: 1B-5B params (Q4_K_M)                          │
│   ├─ VRAM Usage: ~2-6 GB                                        │
│   ├─ Built-in: FlashAttention, CUDA Graphs                      │
│   └─ tensor-split: "1.0,0.0" (100% GPU 0)                       │
│                                                                 │
│   GPU 1: Tesla T4 (15GB VRAM, SM 7.5)                           │
│   ├─ Graphistry[ai] Python SDK                                  │
│   ├─ RAPIDS cuGraph (GPU-accelerated PageRank)                  │
│   ├─ Neural Network Visualization (929 nodes)                   │
│   ├─ VRAM Usage: ~0.5-2 GB                                      │
│   └─ Free VRAM: ~13 GB for analytics                            │
│                                                                 │
│   Built-in C++ Libraries: llama.cpp + NVIDIA NCCL               │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Alternative: Tensor-Split for Large Models (Advanced)

┌─────────────────────────────────────────────────────────────────┐
│       KAGGLE DUAL T4 TENSOR-SPLIT (For models >15GB VRAM)       │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│   GPU 0: Tesla T4 (15GB)    GPU 1: Tesla T4 (15GB)              │
│   ├─ Model Layers 0-39      ├─ Model Layers 40-79               │
│   └─ ~14GB VRAM             └─ ~14GB VRAM                       │
│                                                                 │
│           ← tensor-split 0.5,0.5 (NCCL-based) →                 │
│                                                                 │
│   Total: 30GB VRAM for models up to 70B (IQ3_XS)                │
│   Note: Not recommended for 1B-5B models (use split-GPU)        │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Start Multi-GPU Server

./bin/llama-server \
    -m model.gguf \
    -ngl 99 \
    --tensor-split 0.5,0.5 \
    --split-mode layer \
    -fa \
    --host 0.0.0.0 \
    --port 8080

Python API

from llcuda.server import ServerManager
from llcuda.api.multigpu import kaggle_t4_dual_config
from llcuda.api.client import LlamaCppClient

# Get optimized configuration for Kaggle dual T4
config = kaggle_t4_dual_config()

# Start server with multi-GPU configuration
server = ServerManager()
tensor_split_str = ",".join(str(x) for x in config.tensor_split)
server.start_server(
    model_path="model.gguf",
    gpu_layers=config.n_gpu_layers,
    tensor_split=tensor_split_str,
    split_mode="layer",
    flash_attn=1 if config.flash_attention else 0,
)

# Use OpenAI-compatible API
client = LlamaCppClient("http://localhost:8080")
response = client.chat.create(
    messages=[{"role": "user", "content": "Hello!"}],
    max_tokens=100
)
print(response.choices[0].message.content)

Note: llama.cpp uses native CUDA tensor-split, NOT NCCL. NCCL is available for PyTorch distributed workloads.

📘 Kaggle Multi-GPU Guide →

🔗 Unsloth Integration

Complete workflow from fine-tuning to deployment:

# ═══════════════════════════════════════════════════════════════
# STEP 1: Fine-tune with Unsloth
# ═══════════════════════════════════════════════════════════════
from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Qwen2.5-1.5B-Instruct",
    max_seq_length=2048,
    load_in_4bit=True,
)

# Add LoRA and train...

# ═══════════════════════════════════════════════════════════════
# STEP 2: Export to GGUF
# ═══════════════════════════════════════════════════════════════
model.save_pretrained_gguf(
    "my_model",
    tokenizer,
    quantization_method="q4_k_m"  # Recommended for T4
)

# ═══════════════════════════════════════════════════════════════
# STEP 3: Deploy with llcuda
# ═══════════════════════════════════════════════════════════════
from llcuda.server import ServerManager, ServerConfig

server = ServerManager()
server.start_with_config(ServerConfig(
    model_path="my_model-Q4_K_M.gguf",
    n_gpu_layers=99,
    tensor_split="0.5,0.5",  # Dual T4
    flash_attn=True,
))

📘 Unsloth Integration Guide →

🔧 Split-GPU Architecture

Run LLM inference on GPU 0 while using GPU 1 for RAPIDS/Graphistry analytics:

┌─────────────────┐      ┌─────────────────┐
│   GPU 0 (T4)    │      │   GPU 1 (T4)    │
├─────────────────┤      ├─────────────────┤
│ llama-server    │      │ RAPIDS cuDF     │
│ LLM Inference   │      │ cuGraph         │
│ ~5-12 GB        │      │ Graphistry      │
└─────────────────┘      └─────────────────┘

from llcuda import SplitGPUConfig

config = SplitGPUConfig(llm_gpu=0, graph_gpu=1)
# GPU 0: llama-server (LLM inference)
# GPU 1: RAPIDS cuGraph (graph visualization)

📘 Split-GPU Tutorial →

🎨 GGUF Architecture Visualization ⭐ NEW

Visualize your GGUF models as interactive graphs with Notebook 11:

┌─────────────────────────────────────────────────────────────────┐
│         GGUF NEURAL NETWORK ARCHITECTURE VISUALIZATION          │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│   📊 929 Nodes: Complete Llama-3.2-3B structure                 │
│   🔗 981 Edges: All connections and data flows                  │
│   🎯 896 Attention Heads: Multi-head attention visualized       │
│   📦 112 Quantization Blocks: Q4_K_M structure revealed         │
│   🌐 Interactive Graphistry Dashboards: Cloud + offline HTML    │
│                                                                 │
│   ✨ First comprehensive GGUF visualization tool                │
│   ✨ GPU-accelerated graph analytics (PageRank, centrality)     │
│   ✨ Dual-GPU architecture (inference + visualization)          │
│   ✨ Multi-scale: From overview to individual attention heads   │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

What You Can Visualize:

Layer-by-layer transformer structure (35 nodes per layer)
Attention head importance and connectivity
Quantization block memory layout
Information flow through the network
Critical components via PageRank analysis

📘 GGUF Visualization Guide → | 📓 Notebook 11 → | 📓 Notebook 12 → | 📓 Notebook 13 →

✨ Features

Feature	Description
Kaggle-Optimized	Built specifically for Kaggle dual Tesla T4 (15GB × 2, SM 7.5)
Small Models	Optimized for 1B-5B params GGUF (Q4_K_M) on single T4
Split-GPU	GPU 0: LLM inference, GPU 1: Graphistry visualization
Built-in C++ Libraries	llama.cpp llama-server + NVIDIA NCCL (no compilation needed)
FlashAttention	Built-in for all quantizations (2× speedup)
Unsloth Backend	CUDA 12-first inference for Unsloth-trained models
Graphistry Dashboards	Interactive neural network visualization (929 nodes)
OpenAI API	Full llama.cpp server compatibility
GGUF Tools	Parse, quantize, analyze GGUF files
Auto-download	62KB package, 961MB binaries from GitHub Releases

📊 Performance (Kaggle Single Tesla T4)

Optimized for 1B-5B Models

Model	Size	Quantization	VRAM	Tokens/sec	Recommended
Gemma-3 1B	1.0B	Q4_K_M	~1.2 GB	~50 tok/s	⭐ Best for fast inference
Llama-3.2 1B	1.2B	Q4_K_M	~1.3 GB	~48 tok/s	⭐ Excellent quality
Gemma-2 2B	2.0B	Q4_K_M	~1.8 GB	~45 tok/s	⭐ Balanced
Qwen2.5 3B	3.0B	Q4_K_M	~2.3 GB	~40 tok/s	⭐ High quality
Llama-3.2 3B	3.2B	Q4_K_M	~2.5 GB	~38 tok/s	⭐ Very capable
Gemma-3 4B	4.0B	Q4_K_M	~3.0 GB	~35 tok/s	⭐ Best quality

All tested on single Tesla T4 (15GB VRAM, SM 7.5) with FlashAttention enabled

VRAM Availability (Split-GPU Architecture)

Configuration: GPU 0 for LLM, GPU 1 for Graphistry

GPU 0 Usage:
├─ 1B model: ~1.2 GB → 13.8 GB free
├─ 2B model: ~1.8 GB → 13.2 GB free
├─ 3B model: ~2.5 GB → 12.5 GB free
├─ 4B model: ~3.0 GB → 12.0 GB free
└─ 5B model: ~3.8 GB → 11.2 GB free

GPU 1 Available:
├─ Graphistry: ~0.5-2 GB
├─ RAPIDS cuGraph: ~0.3 GB
└─ Free for analytics: ~13 GB

📓 Tutorial Notebooks

13 comprehensive Kaggle-ready tutorials in notebooks/:

#	Notebook	Description
01	Quick Start	5-minute introduction
02	Server Setup	Advanced server configuration
03	Multi-GPU	Dual T4 tensor-split
04	GGUF Quantization	Complete quantization guide
05	Unsloth Integration	Train → Export → Deploy
06	Split-GPU + Graphistry	LLM + RAPIDS analytics
07	Knowledge Graph Extraction	LLM-powered entity extraction + Graphistry
08	Document Network Analysis	Document similarity + GPU analytics
09	Large Models	Deploy 13B+ on dual T4
10	Complete Workflow	Production end-to-end pipeline
11	GGUF Visualization	⭐ Interactive architecture graphs
12	Attention Mechanism Explorer	Q-K-V attention patterns + Graphistry
13	Token Embedding Visualizer	3D embedding space + Plotly UMAP

📘 Notebooks Index →

📚 Documentation

Core Documentation

Document	Description
QUICK_START.md	Get started in 5 minutes
INSTALL.md	Detailed installation guide
CHANGELOG.md	Version history

In-Depth Guides

Document	Description
docs/INSTALLATION.md	Complete installation reference
docs/CONFIGURATION.md	Server & client configuration
docs/API_REFERENCE.md	Python API documentation
docs/KAGGLE_GUIDE.md	Kaggle-specific guide
docs/GGUF_GUIDE.md	GGUF format & quantization
docs/TROUBLESHOOTING.md	Common issues & solutions

Contributing

Document	Description
CONTRIBUTING.md	How to contribute
docs/BUILD_GUIDE.md	Building from source

📋 Requirements

Platform (Required)

Platform: Kaggle notebooks (https://kaggle.com/code)
GPUs: 2× Tesla T4 (15GB VRAM each, Compute Capability SM 7.5)
Python: 3.11+ (pre-installed on Kaggle)
CUDA: 12.x (pre-installed on Kaggle)

Kaggle Settings (Required)

Accelerator: GPU T4 × 2 (must select dual T4)
Internet: Enabled (for package installation)
Persistence: Enabled (for downloaded models)

Model Requirements

Size: 1B-5B parameters recommended
Format: GGUF (from HuggingFace)
Quantization: Q4_K_M (best quality/speed balance)
Source: Unsloth-compatible models preferred

Note: llcuda v2.2.0 is designed and tested exclusively for Kaggle dual T4 environment. Other platforms are not officially supported.

📦 Binary Package

File	Size	Platform
`llcuda-v2.2.0-cuda12-kaggle-t4x2.tar.gz`	961 MB	Kaggle 2× T4

Build Info:

CUDA 12.5, SM 7.5 (Turing)
llama.cpp b7760 (commit 388ce82)
Build Date: 2026-01-16
Contents: 13 binaries (llama-server, llama-cli, llama-quantize, etc.)

🤝 Contributing

We welcome contributions! See CONTRIBUTING.md for guidelines.

# Development setup
git clone https://github.com/llcuda/llcuda.git
cd llcuda
pip install -e ".[dev]"
pytest tests/

📄 License

MIT — see LICENSE

📓 Tutorial Notebooks (13 notebooks)

Complete tutorial series for llcuda v2.2.0 on Kaggle dual T4 GPUs. Click the badges to open directly in Kaggle or view on GitHub.

#	Notebook	Description
01	Quick Start	5-minute introduction to llcuda
02	Llama Server Setup	Server configuration & lifecycle
03	Multi-GPU Inference	Dual T4 tensor-split configuration
04	GGUF Quantization	K-quants, I-quants, GGUF parsing
05	Unsloth Integration	Fine-tune → GGUF → Deploy
06	Split-GPU + Graphistry	LLM on GPU 0 + RAPIDS on GPU 1
07	Knowledge Graph Extraction	LLM entity extraction + graph visualization
08	Document Network Analysis	Document similarity networks with GPU analytics
09	Large Models (13B+)	Deploy large models on dual T4 with tensor-split
10	Complete Workflow	Production end-to-end: Setup → Model → Server → Analytics → API
11	GGUF Visualization ⭐	MOST IMPORTANT: Dual-GPU architecture visualization with 8 interactive dashboards
12	Attention Mechanism Explorer	Q-K-V attention patterns across all heads with Graphistry dashboards
13	Token Embedding Visualizer	3D embedding space exploration with GPU-accelerated UMAP + Plotly

🎯 Learning Paths

Path	Notebooks	Time	Focus
Quick Start	01 → 02 → 03	1 hour	Get running fast
Full Course	01 → 13 (all)	5.5 hours	Complete mastery + visualization
Unsloth Focus	01 → 04 → 05 → 10	2 hours	Fine-tuning workflow
Large Models	01 → 03 → 09	1.5 hours	70B on Kaggle
Visualization ⭐	01 → 03 → 04 → 06 → 11 → 12 → 13	3.5 hours	Architecture + attention + embeddings

📘 Full Notebook Guide →

Name		Name	Last commit message	Last commit date
Latest commit History 63 Commits
core		core
csrc		csrc
docs		docs
examples		examples
llcuda-pdf-portfolio		llcuda-pdf-portfolio
llcuda		llcuda
notebooks		notebooks
scripts		scripts
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CMakeLists.txt		CMakeLists.txt
CONTRIBUTING.md		CONTRIBUTING.md
INSTALL.md		INSTALL.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
QUICK_START.md		QUICK_START.md
README.md		README.md
pyproject.toml		pyproject.toml
requirements-jupyter.txt		requirements-jupyter.txt
requirements.txt		requirements.txt
server.py		server.py
setup.py		setup.py

Folders and files

Latest commit

History

Repository files navigation

llcuda v2.2.0

📖 Table of Contents

🚀 Installation

Quick Install (Kaggle Notebook)

Verify Installation

⚡ Quick Start (Kaggle Dual T4)

Prerequisites

Basic Inference (Single GPU 0)

Split-GPU Architecture (GPU 0: LLM, GPU 1: Graphistry)

🎯 Split-GPU Architecture (Kaggle 2× T4)

Recommended: GPU 0 for LLM, GPU 1 for Graphistry

Alternative: Tensor-Split for Large Models (Advanced)

Start Multi-GPU Server

Python API

🔗 Unsloth Integration

🔧 Split-GPU Architecture

🎨 GGUF Architecture Visualization ⭐ NEW

✨ Features

📊 Performance (Kaggle Single Tesla T4)

Optimized for 1B-5B Models

VRAM Availability (Split-GPU Architecture)

📓 Tutorial Notebooks

📚 Documentation

Core Documentation

In-Depth Guides

Contributing

📋 Requirements

Platform (Required)

Kaggle Settings (Required)

Model Requirements

📦 Binary Package

🤝 Contributing

📄 License

📓 Tutorial Notebooks (13 notebooks)

🎯 Learning Paths

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages