Feature/refacto dataset#84
Conversation
There was a problem hiding this comment.
Pull request overview
This PR refactors SmartDrift’s pre-processing/consistency checks into a new DatasetAnalysis component, and changes the train/test split strategy to address issue #60 (AUC instability on small datasets) by splitting baseline/current separately before concatenation.
Changes:
- Added
DatasetAnalysis(schema/type/modalities/precision checks + dataset cleaning) and integrated it intoSmartDrift.compile(). - Introduced
train_test_split_concat()andcat_features_indices()utilities and updated drift training pipeline accordingly. - Updated report generation and tests to use the new
DatasetAnalysisfields; pinnedpandas<3in dependencies.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
eurybia/core/dataset_analysis.py |
New dataset comparison/cleaning class used before drift model training. |
eurybia/core/smartdrift.py |
Refactors compile flow to use DatasetAnalysis and new split/concat logic; updates save/load for DA. |
eurybia/utils/utils.py |
Adds categorical feature index helper + split/concat helper used by SmartDrift. |
eurybia/report/generation.py |
Updates “Consistency Analysis” panel to read from DatasetAnalysis and gates modalities display. |
tests/unit_tests/core/test_dataset_analysis.py |
Adds unit tests for DatasetAnalysis. |
tests/unit_tests/core/test_smartdrift.py |
Updates save/load assertions and datetime error message expectation. |
tests/integration_tests/test_integration_smartdrift.py |
Adjusts feature-importance threshold to reflect refactor effects. |
pyproject.toml |
Pins pandas>=2,<3 due to failing tests on pandas 3.x. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
This PR refactors SmartDrift’s pre-compile dataset consistency checks into a new DatasetAnalysis class, and updates the drift training split logic to avoid AUC < 0.5 on small identical datasets (issue #60).
Changes:
- Introduces
DatasetAnalysisto detect schema/type/modalities/float-precision differences and to produce cleaned datasets for drift computation. - Adds utility helpers for categorical feature indexing and for splitting baseline/current separately before concatenation.
- Updates report generation + tests to use the new
DatasetAnalysisfields, and constrainspandasto<3.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
eurybia/core/dataset_analysis.py |
New DatasetAnalysis implementation (consistency checks + dataset cleaning). |
eurybia/core/smartdrift.py |
Uses DatasetAnalysis + new split strategy; adds report flag; adjusts persistence. |
eurybia/utils/utils.py |
Adds cat_features_indices + train_test_split_concat; updates docstrings. |
eurybia/report/generation.py |
Report consistency panel now reads from SmartDrift.da and can optionally show modalities analysis. |
pyproject.toml |
Pins pandas>=2,<3. |
tests/unit_tests/core/test_dataset_analysis.py |
Adds unit tests for DatasetAnalysis. |
tests/unit_tests/core/test_smartdrift.py |
Updates save/load assertions to reflect DatasetAnalysis; updates datetime error message expectation. |
tests/unit_tests/utils/test_utils.py |
Adds a regression-ish test around the AUC behavior. |
tests/integration_tests/test_integration_smartdrift.py |
Relaxes an integration assertion threshold. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
This PR refactors dataset consistency checks into a new DatasetAnalysis class and updates SmartDrift.compile() to use it, including a new split strategy to address issue #60 (AUC < 0.5 on small identical datasets).
Changes:
- Added
DatasetAnalysis(schema/type/modality/precision checks + dataset cleaning) and integrated it intoSmartDrift. - Added utilities
cat_features_indices()andtrain_test_split_concat()and switched SmartDrift training/test creation to the new split approach. - Updated report generation and tests accordingly; constrained dependencies to
pandas<3.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
eurybia/core/dataset_analysis.py |
New DatasetAnalysis implementation and cleaning helpers. |
eurybia/core/smartdrift.py |
Uses DatasetAnalysis, new split/concat workflow, persistence updates, report flag wiring. |
eurybia/utils/utils.py |
Adds cat_features_indices and train_test_split_concat; refreshes docstrings. |
eurybia/report/generation.py |
Switches consistency panel to read from sd.da and gates modalities section. |
pyproject.toml |
Pins pandas to <3. |
tests/unit_tests/core/test_dataset_analysis.py |
New unit tests for DatasetAnalysis. |
tests/unit_tests/core/test_smartdrift.py |
Removes legacy consistency tests; updates save/load assertions to use da. |
tests/unit_tests/utils/test_utils.py |
Adds regression test covering AUC behavior; updates imports. |
tests/integration_tests/test_integration_smartdrift.py |
Adjusts expected drift classifier threshold. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
This PR refactors dataset consistency checking/cleaning into a new DatasetAnalysis class and updates SmartDrift.compile() to use it before computing drift metrics, addressing issue #60 by changing how train/test splits are performed on baseline/current datasets.
Changes:
- Added
DatasetAnalysisto detect schema/type/modality/precision differences and to produce cleaned datasets viaclean_datasets(). - Updated
SmartDriftto rely onDatasetAnalysis, and changed splitting logic to split baseline/current separately before concatenation (fixing low-AUC edge cases). - Added utility helpers (
cat_features_indices,train_test_split_concat), updated tests accordingly, and constrained dependencies topandas<3.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
eurybia/core/dataset_analysis.py |
New dataset comparison + cleaning implementation used by SmartDrift. |
eurybia/core/smartdrift.py |
Integrates DatasetAnalysis, changes split/concat logic, updates save/load behavior. |
eurybia/utils/utils.py |
Adds categorical index detection and split-then-concat helper. |
eurybia/report/generation.py |
Updates consistency panel to read from SmartDrift.da and gates modalities output. |
pyproject.toml |
Pins pandas to <3. |
tests/unit_tests/core/test_dataset_analysis.py |
New unit coverage for DatasetAnalysis. |
tests/unit_tests/core/test_smartdrift.py |
Updates save/load assertions and adds target-column error coverage. |
tests/unit_tests/utils/test_utils.py |
Adds a regression test around AUC behavior on small datasets. |
tests/integration_tests/test_integration_smartdrift.py |
Adjusts an integration assertion threshold. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Description
This PR introduces a DatasetAnalysis class providing attributes and methods to identify inconsistencies between the two datasets, prior to their evaluation in the
compile()method, such as added or removed columns, type mismatch, modalities and floating point precision differences.While this class does not much more than that was already existing in the SmartDrift class, it is meant to be extended by more data quality functions in the future.
From an instance of DatasetAnalysis, a cleaned set of datasets, ready to use by the compiler, is obtained by calling the
clean_datasets()instance method.Two functions were added to utils:
cat_features_indices: returns the indices of the object typed columns of a dataframetrain_test_split_concat: split both datasets into train and test (using sklearn method) and concat both results as suggested in issue AUC below 0.5 #60A limit to
pandas<3is also proposed in the dependencies as the 3.x branch makes some unit test fail.Fixes #60
Type of change
How Has This Been Tested?
Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration