A Python implementation based on the paper: "Practical Secure Inference Algorithm for Fine-tuned Large Language Model Based on Fully Homomorphic Encryption" by Zhang Ruoyan, Zheng Zhongxiang, Bao Wankang (2025)
Paper: https://arxiv.org/abs/2501.01672
This implementation demonstrates the key concepts from the paper:
- Open-LLM + Private-LoRA Architecture: Splits computation between client (base model) and server (LoRA weights)
- Private Linear Layer (PLL): Protects LoRA weights against model extraction attacks using LWE-hard problem
- CKKS Homomorphic Encryption: Enables computation on encrypted user inputs
✅ Privacy-Preserving: User inputs are encrypted before transmission ✅ Model Protection: LoRA weights protected by PLL with cryptographic guarantees ✅ Practical Efficiency: Minimizes expensive ciphertext operations (1.61s/token in paper) ✅ Easy to Understand: Simplified implementation showing core concepts
┌─────────────────────────────────────────────────────────────┐
│ CLIENT SIDE │
├─────────────────────────────────────────────────────────────┤
│ 1. Base LLM (plaintext) │
│ - Open-source model (e.g., ChatGLM2-6B) │
│ - Runs locally, no encryption needed │
│ │
│ 2. Encrypt intermediate result │
│ - CKKS encryption before sending to server │
│ │
│ 4. Decrypt final result │
│ - Receive and decrypt server response │
└─────────────────────────────────────────────────────────────┘
↓ ↑ (encrypted)
┌─────────────────────────────────────────────────────────────┐
│ SERVER SIDE │
├─────────────────────────────────────────────────────────────┤
│ 3. Private LoRA Inference │
│ - LoRA matrices (A1, A2) protected by PLL │
│ - Computation on encrypted data │
│ - Returns encrypted result │
└─────────────────────────────────────────────────────────────┘
pip install numpy tensealpip install -r requirements.txt# Install dependencies
pip install cmake
# Clone and build
git clone https://github.com/OpenMined/TenSEAL.git
cd TenSEAL
pip install .from secure_llm_inference import SecureLLMInference
import numpy as np
# Initialize system
model_dim = 128
lora_rank = 8
system = SecureLLMInference(model_dim, lora_rank)
# User input (e.g., token embedding)
user_input = np.random.randn(1, model_dim)
# Run secure inference
result = system.full_inference(user_input)from secure_llm_inference import demo_secure_inference
# This will run a complete demonstration
demo_secure_inference()The PLL transforms a standard linear layer to protect against model extraction:
y = xA + x'(E' ⊙ P) + sA + kq mod q
Where:
A: Original weight matrix (LoRA matrices)x': Auxiliary vector filled with 1sE': Small Gaussian noise matrixP: Random Bernoulli matrix (dropout-like)s: Secret vector with fixed length γk: Random integer matrixq: Modulus parameter
Security: Breaking PLL is as hard as solving the Learning with Errors (LWE) problem.
Low-Rank Adaptation reduces trainable parameters:
h = Wx + (α/r)ABx
Where:
W: Frozen pre-trained weightsA, B: Low-rank matrices (r << d)α: Scaling factor
- Supports approximate arithmetic on encrypted real numbers
- Enables addition and multiplication on ciphertexts
- Implements rotation for efficient matrix operations
From the paper (ChatGLM2-6B):
- Inference Speed: 1.61 seconds/token
- Comparison: PUMA achieves 200s/token on LLAMA-7B
- Efficiency Gain: ~124x faster
-
User Input Protection:
- All user data encrypted with CKKS before transmission
- Server never sees plaintext inputs
-
Model Weight Protection:
- LoRA weights protected by PLL
- Model extraction reduced to LWE problem (provably hard)
-
Communication Security:
- Only encrypted data transmitted between client/server
This is a simplified demonstration implementation. For production use, you need:
- ✗ Full CKKS matrix multiplication implementation
- ✗ Integration with actual LLMs (ChatGLM2, LLaMA, etc.)
- ✗ Optimized rotation and packing schemes
- ✗ Network protocol for client-server communication
- ✗ Key management system
@article{zhang2025practical,
title={Practical Secure Inference Algorithm for Fine-tuned Large Language Model Based on Fully Homomorphic Encryption},
author={Zhang, Ruoyan and Zheng, Zhongxiang and Bao, Wankang},
journal={arXiv preprint arXiv:2501.01672},
year={2025}
}- PUMA: Secure Transformer Inference (Dong et al., 2023)
- Iron: Private Transformer Inference (He et al., 2022)
- LoRA: Low-Rank Adaptation (Hu et al., 2021)
- CKKS Homomorphic Encryption (Cheon et al., 2017)
- Paper: https://arxiv.org/abs/2501.01672
- Microsoft SEAL: https://github.com/microsoft/SEAL
- TenSEAL: https://github.com/OpenMined/TenSEAL
- LoRA: https://github.com/microsoft/LoRA
This is an educational implementation. For the official implementation, contact the paper authors:
- Corresponding Author: Zheng Zhongxiang (zhengzx@cuc.edu.cn)
- Affiliation: Communication University of China
Educational/Research purposes. Check with paper authors for commercial use.
Implementation based on concepts from:
- Zhang et al. (2025) - Original paper
- Microsoft Research - SEAL library
- OpenMined - TenSEAL library