Classify iris flowers into three species (Setosa, Versicolor, Virginica) based on measurements of their petals and sepals.
The classic Iris dataset from the UCI Repository, loaded via scikit-learn:
- 150 samples (50 per species)
- 4 features:
- Sepal length (cm)
- Sepal width (cm)
- Petal length (cm)
- Petal width (cm)
- 3 classes: Setosa, Versicolor, Virginica
- Python 3.x
- Libraries:
pandas- Data manipulationnumpy- Numerical operationsmatplotlib- Visualizationseaborn- Advanced visualizationscikit-learn- Machine learning models and metrics
- Load the Iris dataset from scikit-learn
- Create a pandas DataFrame for easier manipulation
- Display basic information and statistics
- Pairplot: Visualize relationships between all feature pairs
- Histograms: Show distribution of each feature by species
- Box Plots: Display feature distributions and outliers
- Correlation Heatmap: Show correlation between features
- Check for missing values (none found)
- Split data into training (80%) and test (20%) sets
- Apply feature scaling using StandardScaler
- Use stratified sampling to maintain class balance
Train and compare three classifiers:
- Logistic Regression
- K-Nearest Neighbors (KNN)
- Decision Tree Classifier
Evaluate models using multiple metrics:
- Accuracy: Overall correctness
- Precision: Positive prediction accuracy
- Recall: True positive detection rate
- F1-Score: Harmonic mean of precision and recall
- Confusion Matrix: Detailed prediction breakdown
Analyze which features are most important for classification using the Decision Tree model.
Install required libraries:
pip install pandas numpy matplotlib seaborn scikit-learnRun the main script:
python iris_classification.pyAll three models typically achieve high accuracy (95%+) on this dataset:
- Logistic Regression: ~97-100%
- K-Nearest Neighbors: ~97-100%
- Decision Tree: ~97-100%
The script generates the following visualization files:
- iris_pairplot.png - Scatter plots of all feature combinations
- iris_distributions.png - Histograms showing feature distributions
- iris_boxplots.png - Box plots for each feature by species
- iris_correlation_heatmap.png - Feature correlation matrix
- iris_confusion_matrices.png - Confusion matrices for all models
- iris_model_comparison.png - Performance metrics comparison
- iris_feature_importance.png - Feature importance ranking
- ✅ Loading and exploring datasets
- ✅ Data visualization techniques (scatter plots, histograms, heatmaps)
- ✅ Data preprocessing and scaling
- ✅ Train-test split methodology
- ✅ Classification modeling with multiple algorithms
- ✅ Model evaluation using various metrics
- ✅ Confusion matrix interpretation
- ✅ Feature importance analysis
- ✅ Model comparison and selection
- Petal measurements (length and width) are typically more discriminative than sepal measurements
- Setosa is linearly separable from the other two species
- Versicolor and Virginica have some overlap, making them slightly harder to distinguish
- All three simple classifiers perform excellently on this dataset
- The dataset is well-balanced with no missing values
- UCI Machine Learning Repository - Iris Dataset
- Scikit-learn Documentation
- Original Paper by R.A. Fisher (1936)
Created as a beginner-friendly machine learning project to demonstrate classification techniques.
This project is open source and available for educational purposes.