Welcome to my Fragrance Gender Classification project!
This repository contains code, data, and models for predicting the gender category (Male, Female, Unisex) of fragrances using machine learning techniques, based on rich metadata scraped from fragrantica.com.
This project explores the relationship between fragrance composition and its marketed gender.
By leveraging thousands of fragrance records and their attributes (notes, accords, brand, year, ratings, etc.), we train and evaluate several classification models to predict the gender label.
The goals are:
- Achieve high accuracy in classification
- Gain insights into which features are most indicative of gender
- Source: fragrantica.com
- Files:
fragrances.csv— Raw scraped datafragrances_cleaned.csv— Cleaned and preprocessed datafragrance_features.csv— Engineered featuresfragrance_features_train.csv,fragrance_features_val.csv,fragrance_features_test.csv— Train/validation/test splitsfragrance_features_important.csv— Top 1000 most important features
- Metadata: Brand, Country, Year, Rating, Rating Count
- Notes: Top, Middle, Base notes (encoded)
- Accords: Main accords (encoded)
- Target: Gender (men, women, unisex)
Feature engineering includes:
- Encoding categorical variables
- Extracting note/accord information
- Selecting the most important features using model-based importance scores
We train and compare five classification algorithms:
- Logistic Regression – Benchmark model, interpretable, fast, good for linearly separable data.
- K-Nearest Neighbors (KNN) – Groups fragrances by feature similarity.
- Random Forest – Ensemble of decision trees, provides feature importance.
- AdaBoost – Boosts weak learners, highlights predictive features.
- Gradient Boosting – Sequential tree-based ensemble, captures subtle feature interactions.
Models are trained on the training set, validated, and tested.
Hyperparameters are tuned for optimal F1 score and accuracy.
- Best Model: Logistic Regression (after feature selection and optimization)
- Metrics: Weighted F1 score and accuracy
- Insights:
- 🌸 Floral notes strongly indicate feminine fragrances
- 🌿 Aromatic accords (e.g., lavender, geranium) are masculine indicators
- ⚖️ Unisex fragrances are harder to classify, often containing niche or less traditional notes
Outputs include:
- Confusion matrices
- Classification reports
- Feature importance analysis
- Clone the repository:
git clone https://github.com/Urbanekda/Fragrance_Classifier.git cd fragrance-gender-classification - Install dependencies:
pip install -r requirements.txt
- Run the notebook • Open Fragrance_Classifier.ipynb in Jupyter Notebook or VS Code • Execute cells to preprocess data, train models, and view results
- Model files • Pretrained models are saved as .pkl files (e.g., logistic_model_final.pkl, rf_model.pkl) • You can load these models for inference or further analysis
Fragrance_Classifier.ipynbMain notebook with code, analysis, and resultsFragrance_Classifier.htmlHTML export of the notebook*.csvData files (raw, cleaned, features, splits)*.pklSaved model filesrequirements.txtPython dependencies