Skip to content

Feature/refacto dataset#84

Merged
guerinclement merged 22 commits into
MAIF:masterfrom
guerinclement:feature/refacto_dataset
Mar 23, 2026
Merged

Feature/refacto dataset#84
guerinclement merged 22 commits into
MAIF:masterfrom
guerinclement:feature/refacto_dataset

Conversation

@guerinclement

@guerinclement guerinclement commented Mar 18, 2026

Copy link
Copy Markdown
Collaborator

Description

This PR introduces a DatasetAnalysis class providing attributes and methods to identify inconsistencies between the two datasets, prior to their evaluation in the compile() method, such as added or removed columns, type mismatch, modalities and floating point precision differences.

While this class does not much more than that was already existing in the SmartDrift class, it is meant to be extended by more data quality functions in the future.

From an instance of DatasetAnalysis, a cleaned set of datasets, ready to use by the compiler, is obtained by calling the clean_datasets() instance method.

Two functions were added to utils:

  • cat_features_indices: returns the indices of the object typed columns of a dataframe
  • train_test_split_concat: split both datasets into train and test (using sklearn method) and concat both results as suggested in issue AUC below 0.5 #60

A limit to pandas<3 is also proposed in the dependencies as the 3.x branch makes some unit test fail.

Fixes #60

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration

  • see core/test_dataset_analysis.py file

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors SmartDrift’s pre-processing/consistency checks into a new DatasetAnalysis component, and changes the train/test split strategy to address issue #60 (AUC instability on small datasets) by splitting baseline/current separately before concatenation.

Changes:

  • Added DatasetAnalysis (schema/type/modalities/precision checks + dataset cleaning) and integrated it into SmartDrift.compile().
  • Introduced train_test_split_concat() and cat_features_indices() utilities and updated drift training pipeline accordingly.
  • Updated report generation and tests to use the new DatasetAnalysis fields; pinned pandas<3 in dependencies.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
eurybia/core/dataset_analysis.py New dataset comparison/cleaning class used before drift model training.
eurybia/core/smartdrift.py Refactors compile flow to use DatasetAnalysis and new split/concat logic; updates save/load for DA.
eurybia/utils/utils.py Adds categorical feature index helper + split/concat helper used by SmartDrift.
eurybia/report/generation.py Updates “Consistency Analysis” panel to read from DatasetAnalysis and gates modalities display.
tests/unit_tests/core/test_dataset_analysis.py Adds unit tests for DatasetAnalysis.
tests/unit_tests/core/test_smartdrift.py Updates save/load assertions and datetime error message expectation.
tests/integration_tests/test_integration_smartdrift.py Adjusts feature-importance threshold to reflect refactor effects.
pyproject.toml Pins pandas>=2,<3 due to failing tests on pandas 3.x.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread eurybia/core/dataset_analysis.py Outdated
Comment thread eurybia/core/smartdrift.py Outdated
Comment thread eurybia/utils/utils.py Outdated
Comment thread eurybia/report/generation.py Outdated
Comment thread eurybia/utils/utils.py
Comment thread eurybia/core/smartdrift.py
Comment thread eurybia/core/smartdrift.py Outdated
Comment thread eurybia/utils/utils.py
Comment thread eurybia/utils/utils.py Outdated
Comment thread eurybia/core/dataset_analysis.py

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors SmartDrift’s pre-compile dataset consistency checks into a new DatasetAnalysis class, and updates the drift training split logic to avoid AUC < 0.5 on small identical datasets (issue #60).

Changes:

  • Introduces DatasetAnalysis to detect schema/type/modalities/float-precision differences and to produce cleaned datasets for drift computation.
  • Adds utility helpers for categorical feature indexing and for splitting baseline/current separately before concatenation.
  • Updates report generation + tests to use the new DatasetAnalysis fields, and constrains pandas to <3.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
eurybia/core/dataset_analysis.py New DatasetAnalysis implementation (consistency checks + dataset cleaning).
eurybia/core/smartdrift.py Uses DatasetAnalysis + new split strategy; adds report flag; adjusts persistence.
eurybia/utils/utils.py Adds cat_features_indices + train_test_split_concat; updates docstrings.
eurybia/report/generation.py Report consistency panel now reads from SmartDrift.da and can optionally show modalities analysis.
pyproject.toml Pins pandas>=2,<3.
tests/unit_tests/core/test_dataset_analysis.py Adds unit tests for DatasetAnalysis.
tests/unit_tests/core/test_smartdrift.py Updates save/load assertions to reflect DatasetAnalysis; updates datetime error message expectation.
tests/unit_tests/utils/test_utils.py Adds a regression-ish test around the AUC behavior.
tests/integration_tests/test_integration_smartdrift.py Relaxes an integration assertion threshold.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread eurybia/utils/utils.py
Comment thread eurybia/utils/utils.py Outdated
Comment thread eurybia/core/dataset_analysis.py Outdated
Comment thread eurybia/core/dataset_analysis.py Outdated
Comment thread eurybia/core/smartdrift.py Outdated
Comment thread eurybia/utils/utils.py
Comment thread tests/unit_tests/core/test_smartdrift.py Outdated
Comment thread tests/unit_tests/utils/test_utils.py Outdated

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors dataset consistency checks into a new DatasetAnalysis class and updates SmartDrift.compile() to use it, including a new split strategy to address issue #60 (AUC < 0.5 on small identical datasets).

Changes:

  • Added DatasetAnalysis (schema/type/modality/precision checks + dataset cleaning) and integrated it into SmartDrift.
  • Added utilities cat_features_indices() and train_test_split_concat() and switched SmartDrift training/test creation to the new split approach.
  • Updated report generation and tests accordingly; constrained dependencies to pandas<3.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
eurybia/core/dataset_analysis.py New DatasetAnalysis implementation and cleaning helpers.
eurybia/core/smartdrift.py Uses DatasetAnalysis, new split/concat workflow, persistence updates, report flag wiring.
eurybia/utils/utils.py Adds cat_features_indices and train_test_split_concat; refreshes docstrings.
eurybia/report/generation.py Switches consistency panel to read from sd.da and gates modalities section.
pyproject.toml Pins pandas to <3.
tests/unit_tests/core/test_dataset_analysis.py New unit tests for DatasetAnalysis.
tests/unit_tests/core/test_smartdrift.py Removes legacy consistency tests; updates save/load assertions to use da.
tests/unit_tests/utils/test_utils.py Adds regression test covering AUC behavior; updates imports.
tests/integration_tests/test_integration_smartdrift.py Adjusts expected drift classifier threshold.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/unit_tests/utils/test_utils.py Outdated
Comment thread tests/unit_tests/utils/test_utils.py Outdated
Comment thread tests/unit_tests/core/test_dataset_analysis.py Outdated
Comment thread eurybia/core/dataset_analysis.py Outdated
Comment thread eurybia/core/smartdrift.py Outdated
Comment thread eurybia/core/smartdrift.py Outdated
Comment thread tests/unit_tests/core/test_smartdrift.py Outdated
@guerinclement guerinclement requested a review from Copilot March 19, 2026 08:57

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors dataset consistency checking/cleaning into a new DatasetAnalysis class and updates SmartDrift.compile() to use it before computing drift metrics, addressing issue #60 by changing how train/test splits are performed on baseline/current datasets.

Changes:

  • Added DatasetAnalysis to detect schema/type/modality/precision differences and to produce cleaned datasets via clean_datasets().
  • Updated SmartDrift to rely on DatasetAnalysis, and changed splitting logic to split baseline/current separately before concatenation (fixing low-AUC edge cases).
  • Added utility helpers (cat_features_indices, train_test_split_concat), updated tests accordingly, and constrained dependencies to pandas<3.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
eurybia/core/dataset_analysis.py New dataset comparison + cleaning implementation used by SmartDrift.
eurybia/core/smartdrift.py Integrates DatasetAnalysis, changes split/concat logic, updates save/load behavior.
eurybia/utils/utils.py Adds categorical index detection and split-then-concat helper.
eurybia/report/generation.py Updates consistency panel to read from SmartDrift.da and gates modalities output.
pyproject.toml Pins pandas to <3.
tests/unit_tests/core/test_dataset_analysis.py New unit coverage for DatasetAnalysis.
tests/unit_tests/core/test_smartdrift.py Updates save/load assertions and adds target-column error coverage.
tests/unit_tests/utils/test_utils.py Adds a regression test around AUC behavior on small datasets.
tests/integration_tests/test_integration_smartdrift.py Adjusts an integration assertion threshold.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread eurybia/core/dataset_analysis.py Outdated
Comment thread eurybia/utils/utils.py
Comment thread eurybia/report/generation.py
Comment thread eurybia/report/generation.py Outdated
Comment thread eurybia/core/smartdrift.py Outdated
Comment thread tests/unit_tests/utils/test_utils.py Outdated
@guerinclement guerinclement merged commit a28e65a into MAIF:master Mar 23, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

AUC below 0.5

3 participants