This project demonstrates how to build a custom OpenAI-powered chatbot from scratch using only basic packages like openai
and pandas
. Instead of relying on frameworks such as LangChain, the notebook walks through the process step by step so you can understand how large language models interact with external data "under the hood."
- Goal: Incorporate a dataset of your choice into a chatbot so that it can answer domain-specific questions more effectively.
- Approach:
- Load and preprocess a text dataset.
- Connect to the OpenAI API for baseline Q&A.
- Augment the prompt with custom dataset snippets.
- Compare the model’s responses before and after customization.
Three datasets are included under the data/
directory:
2023_fashion_trends.csv
– Fashion articles and trend summaries.character_descriptions.csv
– Character bios across plays, films, and series.nyc_food_scrap_drop_off_sites.csv
– Locations and details for NYC composting programs.
For this project, we use character_descriptions.csv
because it provides rich narrative text that benefits from chatbot augmentation. With this dataset, the chatbot can role-play or answer specific lore-based questions that a generic model would otherwise miss.
pip install -r requirements.txt
If no requirements.txt
exists, ensure you have:
pip install openai pandas jupyter
Export your OpenAI API key as an environment variable:
export OPENAI_API_KEY="your_key_here"
Or set it directly in the notebook (not recommended for production).
Start Jupyter and open the project notebook:
jupyter notebook project.ipynb
Follow the cells step by step.
- Introduction & Dataset Rationale – Why this dataset matters.
- Load & Inspect Data – Explore the raw CSV.
- Baseline Chatbot – Ask questions without custom data.
- Integrating Dataset – Add context from the dataset.
- Comparative Q&A – Show improved responses with customization.
- Conclusion – When and why to use custom data.
- Fashion Dataset: Build a chatbot stylist.
- Character Dataset: Create a role-play assistant for writers.
- Food Scrap Dataset: Help NYC residents find compost drop-off locations.
- The dataset must contain at least 20 rows of text-rich data.
- Avoid number-heavy datasets, as models are not optimized for numerical reasoning.
- Always compare model answers before vs. after customization.
This project is for educational purposes as part of the Generative AI Udacity Nanodegree.