Cache-to-Cache (C2C) enables Large Language Models to communicate directly through their KV-Caches, bypassing text generation. By projecting and fusing KV-Caches between models, C2C achieves 8.5–10.5% higher accuracy than individual models and 3.0–5.0% better performance than text-based communication, with 2.0× speedup in latency.
Feel free to star the repo or cite the paper if you find it interesting.
@article{fu2025c2c,
title={Cache-to-Cache: Direct Semantic Communication Between Large Language Models},
author={Tianyu Fu and Zihan Min and Hanling Zhang and Jichao Yan and Guohao Dai and Wanli Ouyang and Yu Wang},
journal={arXiv preprint arXiv:2510.03215},
year={2025},
}Why "Rosetta"? The Python package is named after the Rosetta Stone, the ancient artefact that unlocked the translation of Egyptian hieroglyphs by presenting the same text in multiple scripts. Likewise, C2C translates KV-cache representations between otherwise independent LLMs, allowing them to speak a common language in a richer and more direct way.
[2025/12] 🧪 Multi-sharer support is now available! Fuse KV-caches from multiple sharer models to a single receiver. This feature is in preliminary stages and we are still actively working on it. See live_chat_example.py for usage.
[2025/11] 🚀 Thank you for the enthusiasm from the community! Live demo is now available! Try C2C in action with side-by-side model comparison.
[2025/10] 🤗 Our paper was featured as the #1 Paper of the Day on Hugging Face Daily Papers
Combined Wisdom
Only by combining latent semantics from both Qwen2.5 and Qwen3 can this philosophical question be correctly answered.
c2c_demo_video_answer.mp4
The demo can be reproduced with script/playground/gradio_demo.py.
Create a new environment:
conda create -n rosetta python=3.10
conda activate rosettaInstall the package:
pip install -e .For training and evaluation, install additional dependencies:
pip install -e ".[training,evaluation]"Minimal example to load published C2C weights from the Hugging Face collection and run the provided inference script:
import torch
from huggingface_hub import snapshot_download
from script.playground.inference_example import load_rosetta_model, run_inference_example
checkpoint_dir = snapshot_download(
repo_id="nics-efc/C2C_Fuser",
allow_patterns=["qwen3_0.6b+qwen2.5_0.5b_Fuser/*"],
)
model_config = {
"rosetta_config": {
"base_model": "Qwen/Qwen3-0.6B",
"teacher_model": "Qwen/Qwen2.5-0.5B-Instruct",
"checkpoints_dir": f"{checkpoint_dir}/qwen3_0.6b+qwen2.5_0.5b_Fuser/final",
}
}
rosetta_model, tokenizer = load_rosetta_model(model_config, eval_config={}, device=torch.device("cuda"))
device = rosetta_model.device
prompt = [{"role": "user", "content": "Say hello in one short sentence."}]
input_text = tokenizer.apply_chat_template(prompt, tokenize=False, add_generation_prompt=True, enable_thinking=False)
inputs = tokenizer(input_text, return_tensors="pt").to(device)
instruction_index = torch.tensor([1, 0], dtype=torch.long).repeat(inputs['input_ids'].shape[1] - 1, 1).unsqueeze(0).to(device)
label_index = torch.tensor([-1, 0], dtype=torch.long).repeat(1, 1).unsqueeze(0).to(device)
kv_cache_index = [instruction_index, label_index]
with torch.no_grad():
sampling_params = {
'do_sample': False,
'max_new_tokens': 256
}
outputs = rosetta_model.generate(**inputs, kv_cache_index=kv_cache_index, **sampling_params)
output_text = tokenizer.decode(outputs[0, instruction_index.shape[1] + 1:], skip_special_tokens=True)
print(f"C2C output text: {output_text}")We provide an interactive chat example to demonstrate cache-to-cache communication with pre-trained projectors in script/playground/live_chat_example.py.
# Single sharer
python script/playground/live_chat_example.py --checkpoint_dir path/to/checkpoint
# Multiple sharers (teacher models read from each checkpoint's config.json)
python script/playground/live_chat_example.py --checkpoint_dir path/to/ckpt1 path/to/ckpt2You can apply C2C to your own models with a few lines of code. Here is an example:
import torch
from transformers import AutoModelForCausalLM
from rosetta.model.wrapper import RosettaModel
from rosetta.model.projector import C2CProjector
# Load target (receiver) and source (sharer) models
target_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-0.6B")
source_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
# Create C2C projector for KV-Cache transformation
projector_list = []
for i in range(target_model.config.num_hidden_layers):
projector = C2CProjector(
source_dim=128, target_dim=128,
source_num_heads=8, target_num_heads=8,
hidden_dim=1024, num_layers=3
)
projector_list.append(projector)
# If you want to use a pretrained projector, you can load it from the checkpoint directory
# Wrap with RosettaModel for cache-to-cache communication
c2c_model = RosettaModel(
model_list=[target_model, source_model],
base_model_idx=0,
projector_list=projector_list
)
# Configure layer-wise projection mappings
for idx, layer_idx in enumerate(range(target_model.config.num_hidden_layers)):
c2c_model.set_projector_config(
source_model_idx=1, source_model_layer_idx=layer_idx,
target_model_idx=0, target_model_layer_idx=layer_idx,
projector_idx=idx
)
# Generate: kv_cache_index controls when to apply C2C projection
# [1, 0] = apply projection from sharer 1, [-1, 0] = no projection
seq_len = input_ids.shape[1]
instruction_index = torch.tensor([1, 0], dtype=torch.long).repeat(seq_len-1, 1)[None, :, :]
response_index = torch.tensor([[-1, 0]], dtype=torch.long)[None, :, :]
outputs = c2c_model.generate(
kv_cache_index=[instruction_index, response_index],
input_ids=inputs.input_ids,
)Prepare a training configuration file in recipe/train_recipe/. Specify the base model, teacher model, projector type and parameters, training hyperparameters, dataset, and output directory. See recipe/train_recipe/C2C_0.6+0.5.json for a complete example.
Run training:
# Single GPU
python script/train/SFT_train.py --config recipe/train_recipe/C2C_0.6+0.5.json
# Multi-GPU
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
torchrun --nproc_per_node=8 script/train/SFT_train.py \
--config recipe/train_recipe/C2C_0.6+0.5.jsonDuring training, only the C2C projector parameters are updated while both source and target models remain frozen.
Prepare an evaluation configuration file in recipe/eval_recipe/. Specify the model configuration with base model, teacher model, checkpoint directory, generation config, evaluation dataset, and output directory. See recipe/eval_recipe/unified_eval.yaml for a complete example.
Run evaluation:
python script/evaluation/unified_evaluator.py --config recipe/eval_recipe/unified_eval.yamlrosetta/: The main package for Cache-to-Cache.model/: Core model components.train/: Training utilities.baseline/: Baseline implementations.utils/: Utility functions for evaluation and model registry.
script/: Scripts for running experiments.train/: Training scripts includingSFT_train.py.evaluation/: Evaluation scripts includingunified_evaluator.py.dataset/: Dataset preparation scripts.examples/: Usage examples.
recipe/: Configuration files.train_recipe/: Training configurations (e.g.,C2C_0.6+0.5.json).eval_recipe/: Evaluation configurations (e.g.,unified_eval.yaml).
Add a new projector architecture in rosetta/model/projector.py. The projector transforms source model's KV-Cache to target model's semantic space.
Key components: projection networks (MLP/concat-based), gating mechanism for layer-wise selection, and temperature-annealed training. See
C2CProjectorinrosetta/model/projector.pyas an example.
from rosetta.utils.registry import register_model, capture_init_args
@register_model
@capture_init_args
class MyProjector(Projector):
def __init__(self, source_dim, target_dim, **kwargs):
super().__init__()
# Your architecture
self.projection = nn.Linear(source_dim, target_dim)
self.gate = nn.Parameter(torch.tensor(0.0))
def forward(self, source_kv, target_kv):
# Project and fuse KV-caches
return projected_kvRegister in configuration: {"projector": {"type": "MyProjector", "params": {...}}}
Add a new dataset in rosetta/train/dataset_adapters.py for training with your data.
@dataclass
class MyDatasetConfig(DatasetConfig):
dataset_name: str = "my_dataset"
def load(self):
return load_dataset("path/to/dataset")
def my_formatting_func(examples):
return {"text": [f"Q: {q}\nA: {a}" for q, a in zip(...)]}
DATASET_CONFIGS["MyDataset"] = MyDatasetConfig
FORMATTING_FUNCS["MyDataset"] = my_formatting_funcUse in configuration: {"data": {"type": "MyDataset"}}
Add evaluation logic in script/evaluation/ following the pattern in unified_evaluator.py. The evaluator loads models, runs inference, and computes metrics for your benchmark dataset.
- Qwen3-0.6B + Qwen2.5-0.5B-Instruct
- Qwen3-0.6B + Llama-3.2-1B-Instruct
- Qwen3-0.6B + Qwen3-4B-Base
C2C supports arbitrary model pairs. The framework automatically handles:
- Different hidden dimensions
- Different number of layers
- Different attention head configurations
- Different tokenizers
To use custom model pairs, simply specify them in your configurations.