Official code of ''Mixture of Lookup Experts''.
MoLE is a novel edge-friendly LLM architecture. With the same number of activated parameters, MoLE achieves:
- Latency and memory overhead comparable to dense models
- Performance on par with Mixture-of-Experts (MoE) models.
- torch 2.0.1
- transformers 4.38.2
Please refer to pretrain folder.
- modeling_dense.py
- modeling_moe.py
- modeling_mole.py (for training)
- modeling_mole_rep.py (for inference)
All these models are trained on a 100B-token subset of the Pile dataset.
For the MoLE model, we only provide the checkpoints before re-parameterization (i.e., for the training phase). Re-parameterization can be performed using the script provided below.
| Models | # Activated Params | URL |
|---|---|---|
| Dense | 160M | 🤗 JieShibo/Dense-160M |
| MoE-10E | 160M | 🤗 JieShibo/MoE-160M-10E |
| MoLE-4E | 160M | 🤗 JieShibo/MoLE-160M-4E |
| MoE-34E | 160M | 🤗 JieShibo/MoE-160M-34E |
| MoLE-16E | 160M | 🤗 JieShibo/MoLE-160M-16E |
| Dense | 410M | 🤗 JieShibo/Dense-410M |
| MoE-10E | 410M | 🤗 JieShibo/MoE-410M-10E |
| MoLE-4E | 410M | 🤗 JieShibo/MoLE-410M-4E |
python reparameterize.py --from_path <training_model_path> --to_path <inference_model_path>from transformers import AutoTokenizer
from modeling_mole_rep import MoleForCausalLM
model = MoleForCausalLM.from_pretrained(model_path, device_map='cuda')
tokenizer = AutoTokenizer.from_pretrained(tokenizer_path)
inputs = tokenizer("Hello, I am", return_tensors="pt").to(model.device)
tokens = model.generate(**inputs, max_length=10)
print(tokenizer.decode(tokens[0]))Note that since the offloading of LUTs involves the support of the file system, the above demo still puts LUTs in the GPU memory. Alternatively, you can try the following demo, which offloads LUTs to CPU memory. This demo has not been specially optimized, so there may be some inefficiencies.
from transformers import AutoTokenizer
from modeling_mole_rep import MoleForCausalLM
model = MoleForCausalLM.from_pretrained(model_path, device_map='cpu')
model.model.embed_tokens.cuda()
model.model.layers.cuda()
model.model.norm.cuda()
model.lm_head.cuda()
model.model._buffers["causal_mask"] = model.model._buffers["causal_mask"].cuda()
model.model.moe_table.weight.data = model.model.moe_table.weight.data.pin_memory()
tokenizer = AutoTokenizer.from_pretrained(tokenizer_path)
inputs = tokenizer("Hello, I am", return_tensors="pt").to('cuda')
tokens = model.generate(**inputs, max_length=10)
print(tokenizer.decode(tokens[0]))@article{jie2025mole,
title={Mixture of Lookup Experts},
author={Jie, Shibo and Tang, Yehui and Han, Kai and Li, Yitong and Tang, Duyu and Deng, Zhi-Hong and Wang, Yunhe},
journal={arXiv preprint arXiv:2503.15798},
year={2025}
}