Stars
Official Implementation of EAGLE-1 (ICML'24), EAGLE-2 (EMNLP'24), and EAGLE-3 (NeurIPS'25).
Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads
Automatically Discovering Fast Parallelization Strategies for Distributed Deep Neural Network Training
Code for the ICLR 2023 paper "GPTQ: Accurate Post-training Quantization of Generative Pretrained Transformers".
ChatGLM-6B: An Open Bilingual Dialogue Language Model | 开源双语对话语言模型
⚡LLM Zoo is a project that provides data, models, and evaluation benchmark for large language models.⚡
中文LLaMA&Alpaca大语言模型+本地CPU/GPU训练部署 (Chinese LLaMA & Alpaca LLMs)
xk-time 是时间转换,时间计算,时间格式化,时间解析,日历,时间cron表达式和时间NLP等的工具,使用Java8(JSR-310),线程安全,简单易用,多达70几种常用日期格式化模板,支持Java8时间类和Date,轻量级,无第三方依赖。
[ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
CPT: A Pre-Trained Unbalanced Transformer for Both Chinese Language Understanding and Generation
中文 NLP 预处理、解析工具包,准确、高效、易用 A Chinese NLP Preprocessing & Parsing Package www.jionlp.com
Accessible large language models via k-bit quantization for PyTorch.
Transformer related optimization, including BERT, GPT
This repository contains the code for "Generating Datasets with Pretrained Language Models".
[EMNLP 2021] SimCSE: Simple Contrastive Learning of Sentence Embeddings https://arxiv.org/abs/2104.08821
Pretrained language model and its related optimization techniques developed by Huawei Noah's Ark Lab.
Implementation of RETRO, Deepmind's Retrieval based Attention net, in Pytorch
Google Research
Little python library for retrofitting autoregressive decoder transformers to use DeepMinds Retro framework: https://arxiv.org/pdf/2112.04426.pdf
A Keras TensorFlow 2.0 implementation of BERT, ALBERT and adapter-BERT.
Easy and Efficient Transformer : Scalable Inference Solution For Large NLP model
[ICML'21 Oral] I-BERT: Integer-only BERT Quantization
ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
Bolt is a deep learning library with high performance and heterogeneous flexibility.
Accelerated NLP pipelines for fast inference on CPU. Built with Transformers and ONNX runtime.