Skip to content

pairwise difference outlier detection#7

Open
Kadrinizar wants to merge 3 commits into
Karim-53:mainfrom
Kadrinizar:feature/PairwiseDifferenceOutlierDetection
Open

pairwise difference outlier detection#7
Kadrinizar wants to merge 3 commits into
Karim-53:mainfrom
Kadrinizar:feature/PairwiseDifferenceOutlierDetection

Conversation

@Kadrinizar

@Kadrinizar Kadrinizar commented Aug 27, 2025

Copy link
Copy Markdown

Summary by Sourcery

Add a pairwise-difference-based outlier detection estimator with threshold optimization, update dependency constraints, and include a test script comparing it to IsolationForest.

New Features:

  • Implement PairwiseDifferenceOutlierDetection estimator employing pairwise feature differences and automatic F1-based threshold selection.

Build:

  • Loosen numpy version requirement, pin tqdm, and remove fastparquet from dependencies.

Tests:

  • Add unsupervised_outlier_detection script to benchmark PairwiseDifferenceOutlierDetection against IsolationForest on real datasets.

@sourcery-ai

sourcery-ai Bot commented Aug 27, 2025

Copy link
Copy Markdown

Reviewer's Guide

This PR adds a custom PairwiseDifferenceOutlierDetection estimator that learns on pairwise feature differences with an automatic percentile-based threshold selection optimizing F1-macro. It also adjusts dependency versions and introduces a new evaluation script to benchmark the new detector against IsolationForest.

Sequence diagram for training and threshold selection in PairwiseDifferenceOutlierDetection

sequenceDiagram
    participant User
    participant PairwiseDifferenceOutlierDetection
    participant IsolationForest
    User->>PairwiseDifferenceOutlierDetection: fit(X, y)
    PairwiseDifferenceOutlierDetection->>PairwiseDifferenceOutlierDetection: __pair_input(X, X)
    PairwiseDifferenceOutlierDetection->>IsolationForest: fit(pairwise differences)
    PairwiseDifferenceOutlierDetection->>IsolationForest: score_samples(pairwise differences)
    PairwiseDifferenceOutlierDetection->>PairwiseDifferenceOutlierDetection: Select threshold maximizing F1-macro
    PairwiseDifferenceOutlierDetection-->>User: return self
Loading

Sequence diagram for anomaly prediction in PairwiseDifferenceOutlierDetection

sequenceDiagram
    participant User
    participant PairwiseDifferenceOutlierDetection
    participant IsolationForest
    User->>PairwiseDifferenceOutlierDetection: predict(X)
    PairwiseDifferenceOutlierDetection->>PairwiseDifferenceOutlierDetection: score_samples(X)
    PairwiseDifferenceOutlierDetection->>IsolationForest: score_samples(pairwise differences to train set)
    PairwiseDifferenceOutlierDetection->>PairwiseDifferenceOutlierDetection: Compare scores to threshold_
    PairwiseDifferenceOutlierDetection-->>User: return predicted labels
Loading

Class diagram for PairwiseDifferenceOutlierDetection estimator

classDiagram
    class PairwiseDifferenceOutlierDetection {
        +estimator
        +classifier
        +X_train
        +threshold_
        +percentile_
        +__init__(estimator=None)
        +fit(X, y)
        +predict(X)
        +score_samples(X)
        +decision_function(X)
        +__pair_input(X1, X2)
    }
    PairwiseDifferenceOutlierDetection --|> BaseEstimator
    PairwiseDifferenceOutlierDetection --|> OutlierMixin
    PairwiseDifferenceOutlierDetection o-- IsolationForest : uses as default estimator
Loading

File-Level Changes

Change Details Files
Implement PairwiseDifferenceOutlierDetection estimator
  • Add new estimator class with fit, predict, score_samples, and decision_function methods following sklearn API
  • Generate pairwise feature differences via cross-join and compute mean anomaly scores
  • Automatically search percentiles (1–99) on training data to maximize F1-macro and store optimal threshold
  • Include private __pair_input method to build and return both raw and symmetric difference features
pdll/_pairwise.py
Adjust dependency version specifications
  • Change numpy specifier from exact (==1.26.4) to flexible (~=1.26.4)
  • Pin tqdm to version ~=4.66.2
  • Remove fastparquet from requirements
requirements.txt
Add evaluation script for unsupervised outlier detection
  • Create tests/unsupervised_outlier_detection.py to benchmark IsolationForest vs PairwiseDifferenceOutlierDetection
  • Implement dataset loading, preprocessing, train/test splits, and class balancing
  • Compute and aggregate metrics (F1-macro, AUC-ROC, AUC-PR) for both methods
tests/unsupervised_outlier_detection.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Comment thread pdll/_pairwise.py Outdated

@sourcery-ai sourcery-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes and they look great!

Prompt for AI Agents
Please address the comments from this code review:
## Individual Comments

### Comment 1
<location> `pdll/_pairwise.py:1037` </location>
<code_context>
+    self.threshold_ = best_threshold
+    self.percentile_ = best_percentile
+
+    print(f" Best percentile selected: {self.percentile_} with F1-macro: {best_f1:.4f}")
+
+    return self
</code_context>

<issue_to_address>
Use of print statement for logging may not be appropriate in production code.

Replace the print statement with a logging call for improved output management and integration.

Suggested implementation:

```python
    import logging
    logging.info(f"Best percentile selected: {self.percentile_} with F1-macro: {best_f1:.4f}")

```

If the file already imports `logging` at the top, you can omit the `import logging` line in the replacement. For best practice, ensure logging is configured somewhere in your codebase (e.g., with `logging.basicConfig(level=logging.INFO)` in your main entrypoint).
</issue_to_address>

### Comment 2
<location> `pdll/_pairwise.py:1093` </location>
<code_context>
+    Returns:
+    - tuple: (paired feature differences, symmetric feature differences)
+    """
+    X_pair = X1.merge(X2, how="cross")
+
+    # Extract and rename columns for difference calculation
</code_context>

<issue_to_address>
Cross-join for pairwise differences may have high memory and performance cost.

For large inputs, this approach may cause memory issues or slowdowns. Consider implementing checks for input size or using a more efficient method.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH
    X_pair = X1.merge(X2, how="cross")

    # Extract and rename columns for difference calculation
=======
    # Check for large input sizes before performing cross-join
    max_cross_rows = 1_000_000  # threshold for warning/error
    expected_rows = len(X1) * len(X2)
    if expected_rows > max_cross_rows:
        raise ValueError(
            f"PairwiseDifference: Cross-join would result in {expected_rows} rows, "
            "which may cause memory or performance issues. "
            "Consider using smaller inputs or a more efficient method."
        )

    X_pair = X1.merge(X2, how="cross")

    # Extract and rename columns for difference calculation
>>>>>>> REPLACE

</suggested_fix>

### Comment 3
<location> `pdll/_pairwise.py:1099` </location>
<code_context>
+    x1_pair = X_pair[[f'{col}_x' for col in X1.columns]].rename(columns={f'{col}_x': f'{col}_diff' for col in X1.columns})
+    x2_pair = X_pair[[f'{col}_y' for col in X1.columns]].rename(columns={f'{col}_y': f'{col}_diff' for col in X1.columns})
+
+    try:
+        calculate_difference = x1_pair - x2_pair
+    except Exception as e:
</code_context>

<issue_to_address>
Exception handling for non-numeric data is broad.

Catching only relevant exceptions like TypeError will help ensure other errors are not inadvertently ignored.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH
    try:
        calculate_difference = x1_pair - x2_pair
    except Exception as e:
        raise ValueError("PairwiseDifference: Non-numeric data found.") from e
=======
    try:
        calculate_difference = x1_pair - x2_pair
    except TypeError as e:
        raise ValueError("PairwiseDifference: Non-numeric data found.") from e
>>>>>>> REPLACE

</suggested_fix>

### Comment 4
<location> `pdll/_pairwise.py:1105` </location>
<code_context>
+        raise ValueError("PairwiseDifference: Non-numeric data found.") from e
+
+    # Concatenate the original cross-join and calculated differences
+    X_pair = pd.concat([X_pair, calculate_difference], axis='columns')
+
+
</code_context>

<issue_to_address>
Concatenation of DataFrames may result in duplicate columns or misalignment.

Check for overlapping column names before concatenation and validate the output DataFrame's structure and columns.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH
    # Concatenate the original cross-join and calculated differences
=======
    # Check for overlapping column names before concatenation
    overlap = set(X_pair.columns) & set(calculate_difference.columns)
    if overlap:
        raise ValueError(f"PairwiseDifference: Overlapping column names detected during concatenation: {overlap}")

    # Concatenate the original cross-join and calculated differences
    X_pair = pd.concat([X_pair, calculate_difference], axis='columns')

    # Validate the output DataFrame's structure and columns
    expected_columns = list(X_pair.columns)  # Adjust this as needed for your use case
    if X_pair.columns.duplicated().any():
        raise ValueError("PairwiseDifference: Duplicate columns found in the output DataFrame after concatenation.")
>>>>>>> REPLACE

</suggested_fix>

### Comment 5
<location> `pdll/_pairwise.py:1113` </location>
<code_context>
+    x1_pair_sym = X_pair[[f'{col}_y' for col in X1.columns]].rename(columns={f'{col}_y': f'{col}_x' for col in X1.columns})
+    X_pair_sym = pd.concat([x1_pair_sym, x2_pair_sym, x2_pair - x1_pair], axis='columns')
+
+    return X_pair, X_pair_sym
+
+    
</code_context>

<issue_to_address>
Returned symmetric feature differences are not used elsewhere in the class.

Consider removing symmetric feature differences from the return value if they are not needed, to keep the interface clean.

Suggested implementation:

```python

```

```python
    return X_pair

```
</issue_to_address>

### Comment 6
<location> `tests/unsupervised_outlier_detection.py:72` </location>
<code_context>
+    X_sampled = df_sampled.drop(columns=['class']).values
+    y_sampled = df_sampled['class'].values
+
+    # Train/Test Split (without random_state, fully random split)
+    X_train, X_test, y_train, y_test = train_test_split(X_sampled, y_sampled, test_size=0.4)
+    assert len(np.unique(y_train)) == 2
+    assert len(np.unique(y_test)) == 2
</code_context>

<issue_to_address>
Test split does not use a fixed random_state, which may lead to non-reproducible results.

Set random_state in train_test_split for consistent and reproducible test splits.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH
    # Train/Test Split (without random_state, fully random split)
    X_train, X_test, y_train, y_test = train_test_split(X_sampled, y_sampled, test_size=0.4)
=======
    # Train/Test Split (with fixed random_state for reproducibility)
    X_train, X_test, y_train, y_test = train_test_split(X_sampled, y_sampled, test_size=0.4, random_state=42)
>>>>>>> REPLACE

</suggested_fix>

### Comment 7
<location> `tests/unsupervised_outlier_detection.py:77` </location>
<code_context>
+    assert len(np.unique(y_train)) == 2
+    assert len(np.unique(y_test)) == 2
+
+    # Baseline: IsolationForest
+    iso_forest = IsolationForest()
+    iso_forest.fit(X_train)
+    y_pred_iso = np.where(iso_forest.predict(X_test) == -1, 1, 0)
+
</code_context>

<issue_to_address>
No test for error handling when input data contains non-numeric values.

Add a test with non-numeric input to confirm that ValueError is raised by PairwiseDifferenceOutlierDetection.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH
    assert len(np.unique(y_train)) == 2
    assert len(np.unique(y_test)) == 2
=======
    assert len(np.unique(y_train)) == 2
    assert len(np.unique(y_test)) == 2

    # Error handling test: non-numeric input for PairwiseDifferenceOutlierDetection
    import pytest

    X_non_numeric = [['a', 'b'], ['c', 'd']]
    y_dummy = [0, 1]

    from unsupervised_outlier_detection import PairwiseDifferenceOutlierDetection

    with pytest.raises(ValueError):
        model = PairwiseDifferenceOutlierDetection()
        model.fit(X_non_numeric, y_dummy)
>>>>>>> REPLACE

</suggested_fix>

### Comment 8
<location> `tests/unsupervised_outlier_detection.py:96` </location>
<code_context>
+        "AUC_PR": auc_pr_iso
+    })
+
+    # Train PairwiseDifferenceOutlierDetection
+    model = PairwiseDifferenceOutlierDetection()
+
+    # >>> Threshold is optimized inside 'fit()' based on mean anomaly scores
+    # >>> Threshold is found on X_train_pair with best F1 macro (based on y_train)
+    model.fit(pd.DataFrame(X_train), y_train)   # Important "i did it already in PairwiseDifferenceOutlierDetection": Pass X_train_Pair as DataFrame
+
+    # Predict using the best found threshold 
</code_context>

<issue_to_address>
No test for estimator behavior when all samples are identical.

Add a test with identical input samples to verify the estimator's robustness and correctness in this edge case.
</issue_to_address>

### Comment 9
<location> `tests/unsupervised_outlier_detection.py:107` </location>
<code_context>
+    # >>> Prediction is made on X_test by applying the selected best threshold from training
+    y_pred_pdl = model.predict(pd.DataFrame(X_test))
+
+    # Metrics for PairwiseDifferenceOutlierDetection
+    f1_pdl = f1_score(y_test, y_pred_pdl, average='macro')
+    auc_roc_pdl = roc_auc_score(y_test, y_pred_pdl)
+    auc_pr_pdl = average_precision_score(y_test, y_pred_pdl)
+
+    all_results.append({
</code_context>

<issue_to_address>
No test for estimator behavior when input contains NaN or infinite values.

Add a test with NaN or infinite values in the input to ensure the estimator responds correctly.

Suggested implementation:

```python
    model = PairwiseDifferenceOutlierDetection()

    # >>> Threshold is optimized inside 'fit()' based on mean anomaly scores
    # >>> Threshold is found on X_train_pair with best F1 macro (based on y_train)
    model.fit(pd.DataFrame(X_train), y_train)   # Important "i did it already in PairwiseDifferenceOutlierDetection": Pass X_train_Pair as DataFrame

    # Predict using the best found threshold 
    # >>> Prediction is made on X_test by applying the selected best threshold from training
    y_pred_pdl = model.predict(pd.DataFrame(X_test))

    # Test estimator behavior with NaN and infinite values in input
    import numpy as np
    import pytest

    X_test_nan = pd.DataFrame(X_test.copy())
    X_test_nan.iloc[0, 0] = np.nan
    X_test_inf = pd.DataFrame(X_test.copy())
    X_test_inf.iloc[0, 0] = np.inf

    # Check for NaN
    with pytest.raises(ValueError):
        model.predict(X_test_nan)

    # Check for infinite
    with pytest.raises(ValueError):
        model.predict(X_test_inf)

```

- Ensure that `pytest` is available in your test environment.
- If the estimator does not currently raise a `ValueError` for NaN or infinite values, you may need to update the estimator implementation to do so.
- If you use a different test framework, adjust the exception assertion accordingly.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment thread pdll/_pairwise.py
Comment thread pdll/_pairwise.py
Comment on lines +1093 to +1095
X_pair = X1.merge(X2, how="cross")

# Extract and rename columns for difference calculation

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (performance): Cross-join for pairwise differences may have high memory and performance cost.

For large inputs, this approach may cause memory issues or slowdowns. Consider implementing checks for input size or using a more efficient method.

Suggested change
X_pair = X1.merge(X2, how="cross")
# Extract and rename columns for difference calculation
# Check for large input sizes before performing cross-join
max_cross_rows = 1_000_000 # threshold for warning/error
expected_rows = len(X1) * len(X2)
if expected_rows > max_cross_rows:
raise ValueError(
f"PairwiseDifference: Cross-join would result in {expected_rows} rows, "
"which may cause memory or performance issues. "
"Consider using smaller inputs or a more efficient method."
)
X_pair = X1.merge(X2, how="cross")
# Extract and rename columns for difference calculation

Comment thread pdll/_pairwise.py
Comment on lines +1099 to +1102
try:
calculate_difference = x1_pair - x2_pair
except Exception as e:
raise ValueError("PairwiseDifference: Non-numeric data found.") from e

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Exception handling for non-numeric data is broad.

Catching only relevant exceptions like TypeError will help ensure other errors are not inadvertently ignored.

Suggested change
try:
calculate_difference = x1_pair - x2_pair
except Exception as e:
raise ValueError("PairwiseDifference: Non-numeric data found.") from e
try:
calculate_difference = x1_pair - x2_pair
except TypeError as e:
raise ValueError("PairwiseDifference: Non-numeric data found.") from e

Comment thread pdll/_pairwise.py
except Exception as e:
raise ValueError("PairwiseDifference: Non-numeric data found.") from e

# Concatenate the original cross-join and calculated differences

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Concatenation of DataFrames may result in duplicate columns or misalignment.

Check for overlapping column names before concatenation and validate the output DataFrame's structure and columns.

Suggested change
# Concatenate the original cross-join and calculated differences
# Check for overlapping column names before concatenation
overlap = set(X_pair.columns) & set(calculate_difference.columns)
if overlap:
raise ValueError(f"PairwiseDifference: Overlapping column names detected during concatenation: {overlap}")
# Concatenate the original cross-join and calculated differences
X_pair = pd.concat([X_pair, calculate_difference], axis='columns')
# Validate the output DataFrame's structure and columns
expected_columns = list(X_pair.columns) # Adjust this as needed for your use case
if X_pair.columns.duplicated().any():
raise ValueError("PairwiseDifference: Duplicate columns found in the output DataFrame after concatenation.")

Comment thread pdll/_pairwise.py
x1_pair_sym = X_pair[[f'{col}_y' for col in X1.columns]].rename(columns={f'{col}_y': f'{col}_x' for col in X1.columns})
X_pair_sym = pd.concat([x1_pair_sym, x2_pair_sym, x2_pair - x1_pair], axis='columns')

return X_pair, X_pair_sym

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Returned symmetric feature differences are not used elsewhere in the class.

Consider removing symmetric feature differences from the return value if they are not needed, to keep the interface clean.

Suggested implementation:

    return X_pair

Comment on lines +72 to +73
# Train/Test Split (without random_state, fully random split)
X_train, X_test, y_train, y_test = train_test_split(X_sampled, y_sampled, test_size=0.4)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (testing): Test split does not use a fixed random_state, which may lead to non-reproducible results.

Set random_state in train_test_split for consistent and reproducible test splits.

Suggested change
# Train/Test Split (without random_state, fully random split)
X_train, X_test, y_train, y_test = train_test_split(X_sampled, y_sampled, test_size=0.4)
# Train/Test Split (with fixed random_state for reproducibility)
X_train, X_test, y_train, y_test = train_test_split(X_sampled, y_sampled, test_size=0.4, random_state=42)

Comment on lines +74 to +75
assert len(np.unique(y_train)) == 2
assert len(np.unique(y_test)) == 2

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (testing): No test for error handling when input data contains non-numeric values.

Add a test with non-numeric input to confirm that ValueError is raised by PairwiseDifferenceOutlierDetection.

Suggested change
assert len(np.unique(y_train)) == 2
assert len(np.unique(y_test)) == 2
assert len(np.unique(y_train)) == 2
assert len(np.unique(y_test)) == 2
# Error handling test: non-numeric input for PairwiseDifferenceOutlierDetection
import pytest
X_non_numeric = [['a', 'b'], ['c', 'd']]
y_dummy = [0, 1]
from unsupervised_outlier_detection import PairwiseDifferenceOutlierDetection
with pytest.raises(ValueError):
model = PairwiseDifferenceOutlierDetection()
model.fit(X_non_numeric, y_dummy)

Comment on lines +96 to +101
# Train PairwiseDifferenceOutlierDetection
model = PairwiseDifferenceOutlierDetection()

# >>> Threshold is optimized inside 'fit()' based on mean anomaly scores
# >>> Threshold is found on X_train_pair with best F1 macro (based on y_train)
model.fit(pd.DataFrame(X_train), y_train) # Important "i did it already in PairwiseDifferenceOutlierDetection": Pass X_train_Pair as DataFrame

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (testing): No test for estimator behavior when all samples are identical.

Add a test with identical input samples to verify the estimator's robustness and correctness in this edge case.

Comment on lines +107 to +110
# Metrics for PairwiseDifferenceOutlierDetection
f1_pdl = f1_score(y_test, y_pred_pdl, average='macro')
auc_roc_pdl = roc_auc_score(y_test, y_pred_pdl)
auc_pr_pdl = average_precision_score(y_test, y_pred_pdl)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (testing): No test for estimator behavior when input contains NaN or infinite values.

Add a test with NaN or infinite values in the input to ensure the estimator responds correctly.

Suggested implementation:

    model = PairwiseDifferenceOutlierDetection()

    # >>> Threshold is optimized inside 'fit()' based on mean anomaly scores
    # >>> Threshold is found on X_train_pair with best F1 macro (based on y_train)
    model.fit(pd.DataFrame(X_train), y_train)   # Important "i did it already in PairwiseDifferenceOutlierDetection": Pass X_train_Pair as DataFrame

    # Predict using the best found threshold 
    # >>> Prediction is made on X_test by applying the selected best threshold from training
    y_pred_pdl = model.predict(pd.DataFrame(X_test))

    # Test estimator behavior with NaN and infinite values in input
    import numpy as np
    import pytest

    X_test_nan = pd.DataFrame(X_test.copy())
    X_test_nan.iloc[0, 0] = np.nan
    X_test_inf = pd.DataFrame(X_test.copy())
    X_test_inf.iloc[0, 0] = np.inf

    # Check for NaN
    with pytest.raises(ValueError):
        model.predict(X_test_nan)

    # Check for infinite
    with pytest.raises(ValueError):
        model.predict(X_test_inf)
  • Ensure that pytest is available in your test environment.
  • If the estimator does not currently raise a ValueError for NaN or infinite values, you may need to update the estimator implementation to do so.
  • If you use a different test framework, adjust the exception assertion accordingly.

Co-authored-by: Karim-53 <mohamedkarim.belaid@supcom.tn>
Comment thread pdll/_pairwise.py
Comment on lines +50 to +52
# PairwiseDifferenceOutlierDetection class: Author Nizar Kadri <nizar.kadri@campus.lmu.de>
# License: Apache-2.0 clause

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move to line 2

Comment thread pdll/_pairwise.py
Comment on lines +7 to +8
import seaborn as sns

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't do import that are not used -.-'
same in the other .py
you can check that with VS code or with Copilot

Comment thread pdll/_pairwise.py
self.estimator = estimator if estimator is not None else IsolationForest()
self.classifier = self.estimator

def fit(self, X, y):

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is an unsupervised outlier detection y=None should be the default option

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is an unsupervised outlier detection y=None should be the default option:

if y is not None:
# supervised mode → F1-Optimierung
candidate_percentiles = np.arange(1, 100, 1)
best_f1, best_threshold, best_percentile = -np.inf, None, None

        for p in candidate_percentiles:
            threshold = np.percentile(mean_scores, p)
            y_pred = np.where(mean_scores < threshold, 0, 1)
            f1 = f1_score(y, y_pred, average='macro')
            if f1 > best_f1:
                best_f1, best_threshold, best_percentile = f1, threshold, p

        self.threshold_ = best_threshold
        self.percentile_ = best_percentile
        print(f"[supervised] Best percentile={best_percentile}, F1={best_f1:.4f}")
    else:
        # unsupervised mode → Threshold aus contamination
        perc = 100 * contamination
        self.threshold_ = np.percentile(mean_scores, perc)
        self.percentile_ = perc
        print(f"[unsupervised] Threshold set at {perc:.1f}th percentile")

    return self

I have extended the method so that it also works in a fully unsupervised manner —
that is, without using any label information.
However, it turned out that the results (F1, AUC) became worse,
since F1-based threshold tuning is no longer possible.
I am currently working on testing more adaptive thresholding strategies
to improve the performance of the unsupervised approach.

Dataset 1: celeba_baldvsnonbald_normalised.csv
[unsupervised] Threshold set at 10.0th percentile
Dataset 2: bank-additional-full_normalised.csv
[unsupervised] Threshold set at 10.0th percentile
Summary:
Dataset Method Percentile F1_macro
0 Dataset 1 IsolationForest baseline 0.498930
1 Dataset 1 PairwiseDifferenceOutlierDetection 10.0 0.106228
2 Dataset 2 IsolationForest baseline 0.620101
3 Dataset 2 PairwiseDifferenceOutlierDetection 10.0 0.114761

AUC_ROC AUC_PR
0 0.601691 0.074599
1 0.350866 0.045190
2 0.631384 0.133820
3 0.353317 0.054382

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import NearestNeighbors
from sklearn.ensemble import IsolationForest
from sklearn.datasets import fetch_openml

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try to shorten the imports to what you really need

# Dictionary of dataset file paths
file_paths = {
1: "/datasets/celeba_baldvsnonbald_normalised.csv",
2: "/datasets/census-income-full-mixed-binarized.csv",

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is still missing in the repo (you did not upload it)

Comment thread requirements.txt
Comment on lines -1 to +10
numpy==1.26.4
numpy~=1.26.4
pandas~=2.2.0
scikit-learn~=1.3.2
tqdm
tqdm~=4.66.2
openml~=0.14.2
psutil~=5.9.8
joblib~=1.2.0
scipy~=1.11.4
fastparquet No newline at end of file
fastparquet~=2024.2.0 No newline at end of file

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no. revert changes

Comment thread pdll/_pairwise.py
self.estimator.fit(X_train_pair)

# Calculate anomaly scores on the training pairs
scores = abs(self.estimator.score_samples(X_train_pair))

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don’t take abs, keep raw scores, and treat lower as more abnormal. Also, since you subclass OutlierMixin, align with scikit-learn convention: predict should return 1 for inliers and -1 for outliers.

@Karim-53 Karim-53 Aug 27, 2025

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

- scores = abs(self.estimator.score_samples(X_train_pair))
+ scores = self.estimator.score_samples(X_train_pair)
  score_df = pd.DataFrame(scores.reshape((-1, len(self.X_train))))
  mean_scores = score_df.mean(axis=1).to_numpy()

- y_pred = np.where(mean_scores < threshold, 0, 1)
+ # 1 = inlier, -1 = outlier to match sklearn
+ y_pred = np.where(mean_scores < threshold, -1, 1)

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

def predict(self, X):
-    scores = self.score_samples(X)
-    return np.where(scores < self.threshold_, 0, 1)
+    scores = self.score_samples(X)
+    return np.where(scores < self.threshold_, -1, 1)

Comment thread pdll/_pairwise.py
- Pairs all training samples (X_train × X_train)
- Trains the Isolation Forest on pairwise differences
- Searches for the best percentile threshold (1% to 99%) that maximizes F1-macro score
"""

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set common sklearn attrs:

+ from sklearn.utils.validation import check_array, check_is_fitted
+ X = check_array(X, force_all_finite="allow-nan")
+ self.n_features_in_ = X.shape[1]
+ self.feature_names_in_ = getattr(X, "columns", None)

Comment thread pdll/_pairwise.py
Comment on lines +1004 to +1009
self.estimator.fit(X)
self.X_train = X

# Create pairwise differences for training
X_train_pair, _ = self.__pair_input(X, X)
self.estimator.fit(X_train_pair)

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why two .fit() ?????

Comment on lines +31 to +32
import warnings
warnings.filterwarnings('ignore')

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
import warnings
warnings.filterwarnings('ignore')

Comment thread pdll/_pairwise.py
- estimator: A scikit-learn compatible estimator (default is IsolationForest).
"""
self.estimator = estimator if estimator is not None else IsolationForest()
self.classifier = self.estimator

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is never used delete it

Suggested change
self.classifier = self.estimator

Comment thread pdll/_pairwise.py
score_df = pd.DataFrame(scores.reshape((-1, len(self.X_train))))
return score_df.mean(axis=1).to_numpy()

def decision_function(self, X):

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make it consistent with scikit-learn: positive = more normal.

Comment on lines +121 to +124
# Save all results
df_results = pd.DataFrame(all_results)
print("Summary:")
print(df_results) No newline at end of file

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add assert on the expected performance
for example
assert F1_macro > 0.7 # the exact value should be arround 0.71

@@ -0,0 +1,124 @@
# Data manipulation and analysis

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename this file and move it to root
run_benchmark_outlier_detection.py

then add a simple example in folder examples/ and if you want , some unit tests in tests

@Karim-53

Copy link
Copy Markdown
Owner

@codex propose a fix to the failing github actions

@chatgpt-codex-connector

Copy link
Copy Markdown

@codex propose a fix to the failing github actions

For now, I can only help with PRs you've created.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants