A LLaMA2-7b chatbot with memory running on CPU, and optimized using smooth quantization, 4-bit quantization or Intel® Extension For PyTorch with bfloat16.
-
Updated
Feb 27, 2024 - Python
A LLaMA2-7b chatbot with memory running on CPU, and optimized using smooth quantization, 4-bit quantization or Intel® Extension For PyTorch with bfloat16.
INT8 calibrator for ONNX model with dynamic batch_size at the input and NMS module at the output. C++ Implementation.
development quantization framework
VB.NET api wrapper for llm-inference chatllm.cpp
C# api wrapper for llm-inference chatllm.cpp
LLM-Lora-PEFT_accumulate explores optimizations for Large Language Models (LLMs) using PEFT, LORA, and QLORA. Contribute experiments and implementations to enhance LLM efficiency. Join discussions and push the boundaries of LLM optimization. Let's make LLMs more efficient together!
Add a description, image, and links to the int8 topic page so that developers can more easily learn about it.
To associate your repository with the int8 topic, visit your repo's landing page and select "manage topics."