This project demonstrates a chatbot powered by two state-of-the-art large language models: Mistral-7B / Llama2-7B. Both models are optimized for generating human-like responses and can be deployed locally or in the cloud.
The chatbot is built using the Hugging Face Transformers library, Gradio for the web interface, and bitsandbytes for 8-bit quantization to reduce memory usage. This project showcases how to load pre-trained models, generate responses, and create a user-friendly interface for interaction.
- Models:
- Mistral-7B: A modern language model with 7 billion parameters, optimized for instruction-following tasks.
- Llama2-7B: A powerful open-source model from Meta, designed for conversational AI.
- Quantization: Support for 8-bit quantization to reduce GPU memory requirements.
- Offline Mode: Ability to run the chatbot offline by downloading model files locally.
- Online Mode: Option to load models directly from Hugging Face Hub (requires internet access and authentication).
- Web Interface: A simple and intuitive Gradio-based UI for seamless user interaction.
- Optimized Performance: Configurations for both GPU (CUDA) and CPU environments.
Version: Apr 2025, created by Gleb 'Faitsuma' Kiryakov
-
Model Loading:
- The project supports two models: Mistral-7B and Llama2-7B.
- Models can be loaded either from the Hugging Face Hub (online mode) or from local files (offline mode).
- The code dynamically detects whether a GPU is available and adjusts the device (
cudaorcpu) accordingly.
-
Tokenization:
- The tokenizer processes user input into tokens that the model can understand.
- Special tokens (e.g.,
<bos>,<eos>) are handled automatically.
-
Response Generation:
- The model generates responses using advanced techniques like temperature scaling, top-p sampling, and max token length control.
- Responses are decoded back into human-readable text.
-
Web Interface:
- A Gradio-based web interface allows users to interact with the chatbot via a browser.
- The interface includes a text input box for user queries and a text output box for bot responses.
-
Device Management:
- For GPU setups, 8-bit quantization is supported to reduce memory usage.
- For CPU-only setups, 8-bit quantization is disabled, and the model runs in full precision.
- Install the required dependencies:
pip install torch transformers gradio bitsandbytes
- Clone this repository:
git clone https://github.com/Faitsumaru/ai-chatbot-7b
cd ai-chatbot-7b
- Run the script:
python main.py
- Access the Web Interface:
- Open your browser and navigate to
http://localhost:7860. - Enter your query in the input box and click "Send" to get a response.
- Open your browser and navigate to
-
Download the model files:
- Mistral-7B: Visit the Mistral-7B-Instruct-v0.1 page on Hugging Face.
- Download the following files:
config.jsonpytorch_model.bin(ormodel.safetensors)tokenizer.modeltokenizer_config.jsonspecial_tokens_map.json
- Download the following files:
- Llama2-7B: Visit the Llama2-7B page on Hugging Face.
- Download the same set of files as above.
- Mistral-7B: Visit the Mistral-7B-Instruct-v0.1 page on Hugging Face.
-
Place the downloaded files in separate folders, e.g.:
/path/to/local/mistral-model/path/to/local/llama-model
-
Update the script to point to the local paths: model_path = "/path/to/local/mistral-model" # or "/path/to/local/llama-model"
-
Run the script:
python main.py
- Access the Web Interface:
- Open your browser and navigate to
http://localhost:7860.
- Open your browser and navigate to
- Hugging Face Transformers: Documentation
- Gradio: Documentation
- bitsandbytes: Installation Guide
- PyTorch: Documentation
- Mistral-7B Model: Hugging Face Page
- Llama2-7B Model: Hugging Face Page
- Quantization: Bitsandbytes Multi-Backend Support
- SentencePiece Tokenizer: GitHub Repository
- CUDA Installation: NVIDIA CUDA Toolkit
- Python Virtual Environments: venv Documentation
- GPU Requirement: For optimal performance, a GPU with at least 8GB VRAM is recommended. If you don't have a GPU, the model can run on CPU, but it will be slower.
- Offline Mode: Ensure all model files are downloaded and placed in the correct directory before running the script in offline mode.
- Authentication: For online mode, ensure you have a valid Hugging Face token and have accepted the terms of use for the models.
- Error Handling: If you encounter issues, check the device configuration (
cudavs.cpu) and ensure all dependencies are installed correctly.