Feature/refacto dataset by guerinclement · Pull Request #84 · MAIF/eurybia

guerinclement · 2026-03-18T15:11:36Z

Description

This PR introduces a DatasetAnalysis class providing attributes and methods to identify inconsistencies between the two datasets, prior to their evaluation in the compile() method, such as added or removed columns, type mismatch, modalities and floating point precision differences.

While this class does not much more than that was already existing in the SmartDrift class, it is meant to be extended by more data quality functions in the future.

From an instance of DatasetAnalysis, a cleaned set of datasets, ready to use by the compiler, is obtained by calling the clean_datasets() instance method.

Two functions were added to utils:

cat_features_indices: returns the indices of the object typed columns of a dataframe
train_test_split_concat: split both datasets into train and test (using sklearn method) and concat both results as suggested in issue AUC below 0.5 #60

A limit to pandas<3 is also proposed in the dependencies as the 3.x branch makes some unit test fail.

Fixes #60

Type of change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration

see core/test_dataset_analysis.py file

Copilot

Pull request overview

This PR refactors SmartDrift’s pre-processing/consistency checks into a new DatasetAnalysis component, and changes the train/test split strategy to address issue #60 (AUC instability on small datasets) by splitting baseline/current separately before concatenation.

Changes:

Added DatasetAnalysis (schema/type/modalities/precision checks + dataset cleaning) and integrated it into SmartDrift.compile().
Introduced train_test_split_concat() and cat_features_indices() utilities and updated drift training pipeline accordingly.
Updated report generation and tests to use the new DatasetAnalysis fields; pinned pandas<3 in dependencies.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 10 comments.

Show a summary per file

File	Description
`eurybia/core/dataset_analysis.py`	New dataset comparison/cleaning class used before drift model training.
`eurybia/core/smartdrift.py`	Refactors compile flow to use `DatasetAnalysis` and new split/concat logic; updates save/load for DA.
`eurybia/utils/utils.py`	Adds categorical feature index helper + split/concat helper used by `SmartDrift`.
`eurybia/report/generation.py`	Updates “Consistency Analysis” panel to read from `DatasetAnalysis` and gates modalities display.
`tests/unit_tests/core/test_dataset_analysis.py`	Adds unit tests for `DatasetAnalysis`.
`tests/unit_tests/core/test_smartdrift.py`	Updates save/load assertions and datetime error message expectation.
`tests/integration_tests/test_integration_smartdrift.py`	Adjusts feature-importance threshold to reflect refactor effects.
`pyproject.toml`	Pins `pandas>=2,<3` due to failing tests on pandas 3.x.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

This PR refactors SmartDrift’s pre-compile dataset consistency checks into a new DatasetAnalysis class, and updates the drift training split logic to avoid AUC < 0.5 on small identical datasets (issue #60).

Changes:

Introduces DatasetAnalysis to detect schema/type/modalities/float-precision differences and to produce cleaned datasets for drift computation.
Adds utility helpers for categorical feature indexing and for splitting baseline/current separately before concatenation.
Updates report generation + tests to use the new DatasetAnalysis fields, and constrains pandas to <3.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
`eurybia/core/dataset_analysis.py`	New DatasetAnalysis implementation (consistency checks + dataset cleaning).
`eurybia/core/smartdrift.py`	Uses DatasetAnalysis + new split strategy; adds report flag; adjusts persistence.
`eurybia/utils/utils.py`	Adds `cat_features_indices` + `train_test_split_concat`; updates docstrings.
`eurybia/report/generation.py`	Report consistency panel now reads from `SmartDrift.da` and can optionally show modalities analysis.
`pyproject.toml`	Pins `pandas>=2,<3`.
`tests/unit_tests/core/test_dataset_analysis.py`	Adds unit tests for DatasetAnalysis.
`tests/unit_tests/core/test_smartdrift.py`	Updates save/load assertions to reflect DatasetAnalysis; updates datetime error message expectation.
`tests/unit_tests/utils/test_utils.py`	Adds a regression-ish test around the AUC behavior.
`tests/integration_tests/test_integration_smartdrift.py`	Relaxes an integration assertion threshold.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

This PR refactors dataset consistency checks into a new DatasetAnalysis class and updates SmartDrift.compile() to use it, including a new split strategy to address issue #60 (AUC < 0.5 on small identical datasets).

Changes:

Added DatasetAnalysis (schema/type/modality/precision checks + dataset cleaning) and integrated it into SmartDrift.
Added utilities cat_features_indices() and train_test_split_concat() and switched SmartDrift training/test creation to the new split approach.
Updated report generation and tests accordingly; constrained dependencies to pandas<3.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
`eurybia/core/dataset_analysis.py`	New DatasetAnalysis implementation and cleaning helpers.
`eurybia/core/smartdrift.py`	Uses DatasetAnalysis, new split/concat workflow, persistence updates, report flag wiring.
`eurybia/utils/utils.py`	Adds `cat_features_indices` and `train_test_split_concat`; refreshes docstrings.
`eurybia/report/generation.py`	Switches consistency panel to read from `sd.da` and gates modalities section.
`pyproject.toml`	Pins pandas to `<3`.
`tests/unit_tests/core/test_dataset_analysis.py`	New unit tests for DatasetAnalysis.
`tests/unit_tests/core/test_smartdrift.py`	Removes legacy consistency tests; updates save/load assertions to use `da`.
`tests/unit_tests/utils/test_utils.py`	Adds regression test covering AUC behavior; updates imports.
`tests/integration_tests/test_integration_smartdrift.py`	Adjusts expected drift classifier threshold.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

This PR refactors dataset consistency checking/cleaning into a new DatasetAnalysis class and updates SmartDrift.compile() to use it before computing drift metrics, addressing issue #60 by changing how train/test splits are performed on baseline/current datasets.

Changes:

Added DatasetAnalysis to detect schema/type/modality/precision differences and to produce cleaned datasets via clean_datasets().
Updated SmartDrift to rely on DatasetAnalysis, and changed splitting logic to split baseline/current separately before concatenation (fixing low-AUC edge cases).
Added utility helpers (cat_features_indices, train_test_split_concat), updated tests accordingly, and constrained dependencies to pandas<3.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
`eurybia/core/dataset_analysis.py`	New dataset comparison + cleaning implementation used by `SmartDrift`.
`eurybia/core/smartdrift.py`	Integrates `DatasetAnalysis`, changes split/concat logic, updates save/load behavior.
`eurybia/utils/utils.py`	Adds categorical index detection and split-then-concat helper.
`eurybia/report/generation.py`	Updates consistency panel to read from `SmartDrift.da` and gates modalities output.
`pyproject.toml`	Pins pandas to `<3`.
`tests/unit_tests/core/test_dataset_analysis.py`	New unit coverage for `DatasetAnalysis`.
`tests/unit_tests/core/test_smartdrift.py`	Updates save/load assertions and adds target-column error coverage.
`tests/unit_tests/utils/test_utils.py`	Adds a regression test around AUC behavior on small datasets.
`tests/integration_tests/test_integration_smartdrift.py`	Adjusts an integration assertion threshold.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

guerinclement and others added 7 commits November 20, 2025 12:12

DatasetAnalysis class

57892cd

DatasetAnalysis doc

461b38e

pandas 3.x constraint

9534bfb

dataset analysis

602e180

Merge branch 'master' into feature/refacto_dataset

d2420e3

removed commented out code

21e7797

test_compile_smartdrift_2

5aa8339

guerinclement requested review from Copilot and guillaume-vignal March 18, 2026 15:11

Copilot started reviewing on behalf of guerinclement March 18, 2026 15:12 View session

Copilot AI reviewed Mar 18, 2026

View reviewed changes

guerinclement added 8 commits March 18, 2026 16:41

doc train_test_split_concat

54ad538

test train_test_split_concat

44cf91e

random_state for sampling

18f2997

remove useless rename

e540ac4

normalize docstrings

8d532df

target col

47b90a0

cat_features_indices simplification

9ded7ea

fix ambiguous dtype_mismatches doc

fcb3be0

guerinclement requested a review from Copilot March 18, 2026 16:21

Copilot started reviewing on behalf of guerinclement March 18, 2026 16:22 View session

Copilot AI reviewed Mar 18, 2026

View reviewed changes

guerinclement added 5 commits March 18, 2026 17:40

format docstring

d659006

fix categorical_value_differences

c4a314f

fix test

fcbf706

target

3db804e

fix test

1c69c33

guerinclement requested a review from Copilot March 18, 2026 16:52

Copilot started reviewing on behalf of guerinclement March 18, 2026 16:52 View session

Copilot AI reviewed Mar 18, 2026

View reviewed changes

fix review

9f95091

guerinclement requested a review from Copilot March 19, 2026 08:57

Copilot AI reviewed Mar 19, 2026

View reviewed changes

fix review

97283df

guillaume-vignal approved these changes Mar 23, 2026

View reviewed changes

guerinclement merged commit a28e65a into MAIF:master Mar 23, 2026
4 checks passed

Conversation

guerinclement commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

How Has This Been Tested?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

guerinclement commented Mar 18, 2026 •

edited

Loading