MoLE

Official code of ''Mixture of Lookup Experts''.

MoLE is a novel edge-friendly LLM architecture. With the same number of activated parameters, MoLE achieves:

Latency and memory overhead comparable to dense models
Performance on par with Mixture-of-Experts (MoE) models.

Environment

torch 2.0.1
transformers 4.38.2

Pretraining

Please refer to pretrain folder.

Models

Dense Models

modeling_dense.py

Moe Models

modeling_moe.py

MoLE Models

modeling_mole.py (for training)
modeling_mole_rep.py (for inference)

Checkpoints

All these models are trained on a 100B-token subset of the Pile dataset.

For the MoLE model, we only provide the checkpoints before re-parameterization (i.e., for the training phase). Re-parameterization can be performed using the script provided below.

Models	# Activated Params	URL
Dense	160M	🤗 JieShibo/Dense-160M
MoE-10E	160M	🤗 JieShibo/MoE-160M-10E
MoLE-4E	160M	🤗 JieShibo/MoLE-160M-4E
MoE-34E	160M	🤗 JieShibo/MoE-160M-34E
MoLE-16E	160M	🤗 JieShibo/MoLE-160M-16E
Dense	410M	🤗 JieShibo/Dense-410M
MoE-10E	410M	🤗 JieShibo/MoE-410M-10E
MoLE-4E	410M	🤗 JieShibo/MoLE-410M-4E

Reparameterize MoLE for Inference

python reparameterize.py --from_path <training_model_path> --to_path <inference_model_path>

Inference

from transformers import AutoTokenizer
from modeling_mole_rep import MoleForCausalLM
model = MoleForCausalLM.from_pretrained(model_path, device_map='cuda')
tokenizer = AutoTokenizer.from_pretrained(tokenizer_path)
inputs = tokenizer("Hello, I am", return_tensors="pt").to(model.device)
tokens = model.generate(**inputs, max_length=10)
print(tokenizer.decode(tokens[0]))

Note that since the offloading of LUTs involves the support of the file system, the above demo still puts LUTs in the GPU memory. Alternatively, you can try the following demo, which offloads LUTs to CPU memory. This demo has not been specially optimized, so there may be some inefficiencies.

from transformers import AutoTokenizer
from modeling_mole_rep import MoleForCausalLM
model = MoleForCausalLM.from_pretrained(model_path, device_map='cpu')
model.model.embed_tokens.cuda()
model.model.layers.cuda()
model.model.norm.cuda()
model.lm_head.cuda()
model.model._buffers["causal_mask"] = model.model._buffers["causal_mask"].cuda()
model.model.moe_table.weight.data = model.model.moe_table.weight.data.pin_memory()
tokenizer = AutoTokenizer.from_pretrained(tokenizer_path)
inputs = tokenizer("Hello, I am", return_tensors="pt").to('cuda')
tokens = model.generate(**inputs, max_length=10)
print(tokenizer.decode(tokens[0]))

Citation

@article{jie2025mole,
  title={Mixture of Lookup Experts},
  author={Jie, Shibo and Tang, Yehui and Han, Kai and Li, Yitong and Tang, Duyu and Deng, Zhi-Hong and Wang, Yunhe},
  journal={arXiv preprint arXiv:2503.15798},
  year={2025}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MoLE

Environment

Pretraining

Models

Dense Models

Moe Models

MoLE Models

Checkpoints

Reparameterize MoLE for Inference

Inference

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
pretrain		pretrain
LICENSE		LICENSE
README.md		README.md
modeling_dense.py		modeling_dense.py
modeling_moe.py		modeling_moe.py
modeling_mole.py		modeling_mole.py
modeling_mole_rep.py		modeling_mole_rep.py
reparameterize.py		reparameterize.py

Folders and files

Latest commit

History

Repository files navigation

MoLE

Environment

Pretraining

Models

Dense Models

Moe Models

MoLE Models

Checkpoints

Reparameterize MoLE for Inference

Inference

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages