pairwise difference outlier detection by Kadrinizar · Pull Request #7 · Karim-53/pdll

Kadrinizar · 2025-08-27T14:22:50Z

Summary by Sourcery

Add a pairwise-difference-based outlier detection estimator with threshold optimization, update dependency constraints, and include a test script comparing it to IsolationForest.

New Features:

Implement PairwiseDifferenceOutlierDetection estimator employing pairwise feature differences and automatic F1-based threshold selection.

Build:

Loosen numpy version requirement, pin tqdm, and remove fastparquet from dependencies.

Tests:

Add unsupervised_outlier_detection script to benchmark PairwiseDifferenceOutlierDetection against IsolationForest on real datasets.

sourcery-ai · 2025-08-27T14:23:03Z

Reviewer's Guide

This PR adds a custom PairwiseDifferenceOutlierDetection estimator that learns on pairwise feature differences with an automatic percentile-based threshold selection optimizing F1-macro. It also adjusts dependency versions and introduces a new evaluation script to benchmark the new detector against IsolationForest.

Sequence diagram for training and threshold selection in PairwiseDifferenceOutlierDetection

sequenceDiagram
    participant User
    participant PairwiseDifferenceOutlierDetection
    participant IsolationForest
    User->>PairwiseDifferenceOutlierDetection: fit(X, y)
    PairwiseDifferenceOutlierDetection->>PairwiseDifferenceOutlierDetection: __pair_input(X, X)
    PairwiseDifferenceOutlierDetection->>IsolationForest: fit(pairwise differences)
    PairwiseDifferenceOutlierDetection->>IsolationForest: score_samples(pairwise differences)
    PairwiseDifferenceOutlierDetection->>PairwiseDifferenceOutlierDetection: Select threshold maximizing F1-macro
    PairwiseDifferenceOutlierDetection-->>User: return self

Sequence diagram for anomaly prediction in PairwiseDifferenceOutlierDetection

sequenceDiagram
    participant User
    participant PairwiseDifferenceOutlierDetection
    participant IsolationForest
    User->>PairwiseDifferenceOutlierDetection: predict(X)
    PairwiseDifferenceOutlierDetection->>PairwiseDifferenceOutlierDetection: score_samples(X)
    PairwiseDifferenceOutlierDetection->>IsolationForest: score_samples(pairwise differences to train set)
    PairwiseDifferenceOutlierDetection->>PairwiseDifferenceOutlierDetection: Compare scores to threshold_
    PairwiseDifferenceOutlierDetection-->>User: return predicted labels

Class diagram for PairwiseDifferenceOutlierDetection estimator

classDiagram
    class PairwiseDifferenceOutlierDetection {
        +estimator
        +classifier
        +X_train
        +threshold_
        +percentile_
        +__init__(estimator=None)
        +fit(X, y)
        +predict(X)
        +score_samples(X)
        +decision_function(X)
        +__pair_input(X1, X2)
    }
    PairwiseDifferenceOutlierDetection --|> BaseEstimator
    PairwiseDifferenceOutlierDetection --|> OutlierMixin
    PairwiseDifferenceOutlierDetection o-- IsolationForest : uses as default estimator

File-Level Changes

Change	Details	Files
Implement PairwiseDifferenceOutlierDetection estimator	Add new estimator class with fit, predict, score_samples, and decision_function methods following sklearn API Generate pairwise feature differences via cross-join and compute mean anomaly scores Automatically search percentiles (1–99) on training data to maximize F1-macro and store optimal threshold Include private __pair_input method to build and return both raw and symmetric difference features	`pdll/_pairwise.py`
Adjust dependency version specifications	Change numpy specifier from exact (==1.26.4) to flexible (~=1.26.4) Pin tqdm to version ~=4.66.2 Remove fastparquet from requirements	`requirements.txt`
Add evaluation script for unsupervised outlier detection	Create tests/unsupervised_outlier_detection.py to benchmark IsolationForest vs PairwiseDifferenceOutlierDetection Implement dataset loading, preprocessing, train/test splits, and class balancing Compute and aggregate metrics (F1-macro, AUC-ROC, AUC-PR) for both methods	`tests/unsupervised_outlier_detection.py`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

sourcery-ai

Hey there - I've reviewed your changes and they look great!

Prompt for AI Agents

Please address the comments from this code review:
## Individual Comments

### Comment 1
<location> `pdll/_pairwise.py:1037` </location>
<code_context>
+    self.threshold_ = best_threshold
+    self.percentile_ = best_percentile
+
+    print(f" Best percentile selected: {self.percentile_} with F1-macro: {best_f1:.4f}")
+
+    return self
</code_context>

<issue_to_address>
Use of print statement for logging may not be appropriate in production code.

Replace the print statement with a logging call for improved output management and integration.

Suggested implementation:

```python
    import logging
    logging.info(f"Best percentile selected: {self.percentile_} with F1-macro: {best_f1:.4f}")

```

If the file already imports `logging` at the top, you can omit the `import logging` line in the replacement. For best practice, ensure logging is configured somewhere in your codebase (e.g., with `logging.basicConfig(level=logging.INFO)` in your main entrypoint).
</issue_to_address>

### Comment 2
<location> `pdll/_pairwise.py:1093` </location>
<code_context>
+    Returns:
+    - tuple: (paired feature differences, symmetric feature differences)
+    """
+    X_pair = X1.merge(X2, how="cross")
+
+    # Extract and rename columns for difference calculation
</code_context>

<issue_to_address>
Cross-join for pairwise differences may have high memory and performance cost.

For large inputs, this approach may cause memory issues or slowdowns. Consider implementing checks for input size or using a more efficient method.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH
    X_pair = X1.merge(X2, how="cross")

    # Extract and rename columns for difference calculation
=======
    # Check for large input sizes before performing cross-join
    max_cross_rows = 1_000_000  # threshold for warning/error
    expected_rows = len(X1) * len(X2)
    if expected_rows > max_cross_rows:
        raise ValueError(
            f"PairwiseDifference: Cross-join would result in {expected_rows} rows, "
            "which may cause memory or performance issues. "
            "Consider using smaller inputs or a more efficient method."
        )

    X_pair = X1.merge(X2, how="cross")

    # Extract and rename columns for difference calculation
>>>>>>> REPLACE

</suggested_fix>

### Comment 3
<location> `pdll/_pairwise.py:1099` </location>
<code_context>
+    x1_pair = X_pair[[f'{col}_x' for col in X1.columns]].rename(columns={f'{col}_x': f'{col}_diff' for col in X1.columns})
+    x2_pair = X_pair[[f'{col}_y' for col in X1.columns]].rename(columns={f'{col}_y': f'{col}_diff' for col in X1.columns})
+
+    try:
+        calculate_difference = x1_pair - x2_pair
+    except Exception as e:
</code_context>

<issue_to_address>
Exception handling for non-numeric data is broad.

Catching only relevant exceptions like TypeError will help ensure other errors are not inadvertently ignored.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH
    try:
        calculate_difference = x1_pair - x2_pair
    except Exception as e:
        raise ValueError("PairwiseDifference: Non-numeric data found.") from e
=======
    try:
        calculate_difference = x1_pair - x2_pair
    except TypeError as e:
        raise ValueError("PairwiseDifference: Non-numeric data found.") from e
>>>>>>> REPLACE

</suggested_fix>

### Comment 4
<location> `pdll/_pairwise.py:1105` </location>
<code_context>
+        raise ValueError("PairwiseDifference: Non-numeric data found.") from e
+
+    # Concatenate the original cross-join and calculated differences
+    X_pair = pd.concat([X_pair, calculate_difference], axis='columns')
+
+
</code_context>

<issue_to_address>
Concatenation of DataFrames may result in duplicate columns or misalignment.

Check for overlapping column names before concatenation and validate the output DataFrame's structure and columns.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH
    # Concatenate the original cross-join and calculated differences
=======
    # Check for overlapping column names before concatenation
    overlap = set(X_pair.columns) & set(calculate_difference.columns)
    if overlap:
        raise ValueError(f"PairwiseDifference: Overlapping column names detected during concatenation: {overlap}")

    # Concatenate the original cross-join and calculated differences
    X_pair = pd.concat([X_pair, calculate_difference], axis='columns')

    # Validate the output DataFrame's structure and columns
    expected_columns = list(X_pair.columns)  # Adjust this as needed for your use case
    if X_pair.columns.duplicated().any():
        raise ValueError("PairwiseDifference: Duplicate columns found in the output DataFrame after concatenation.")
>>>>>>> REPLACE

</suggested_fix>

### Comment 5
<location> `pdll/_pairwise.py:1113` </location>
<code_context>
+    x1_pair_sym = X_pair[[f'{col}_y' for col in X1.columns]].rename(columns={f'{col}_y': f'{col}_x' for col in X1.columns})
+    X_pair_sym = pd.concat([x1_pair_sym, x2_pair_sym, x2_pair - x1_pair], axis='columns')
+
+    return X_pair, X_pair_sym
+
+    
</code_context>

<issue_to_address>
Returned symmetric feature differences are not used elsewhere in the class.

Consider removing symmetric feature differences from the return value if they are not needed, to keep the interface clean.

Suggested implementation:

```python

```

```python
    return X_pair

```
</issue_to_address>

### Comment 6
<location> `tests/unsupervised_outlier_detection.py:72` </location>
<code_context>
+    X_sampled = df_sampled.drop(columns=['class']).values
+    y_sampled = df_sampled['class'].values
+
+    # Train/Test Split (without random_state, fully random split)
+    X_train, X_test, y_train, y_test = train_test_split(X_sampled, y_sampled, test_size=0.4)
+    assert len(np.unique(y_train)) == 2
+    assert len(np.unique(y_test)) == 2
</code_context>

<issue_to_address>
Test split does not use a fixed random_state, which may lead to non-reproducible results.

Set random_state in train_test_split for consistent and reproducible test splits.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH
    # Train/Test Split (without random_state, fully random split)
    X_train, X_test, y_train, y_test = train_test_split(X_sampled, y_sampled, test_size=0.4)
=======
    # Train/Test Split (with fixed random_state for reproducibility)
    X_train, X_test, y_train, y_test = train_test_split(X_sampled, y_sampled, test_size=0.4, random_state=42)
>>>>>>> REPLACE

</suggested_fix>

### Comment 7
<location> `tests/unsupervised_outlier_detection.py:77` </location>
<code_context>
+    assert len(np.unique(y_train)) == 2
+    assert len(np.unique(y_test)) == 2
+
+    # Baseline: IsolationForest
+    iso_forest = IsolationForest()
+    iso_forest.fit(X_train)
+    y_pred_iso = np.where(iso_forest.predict(X_test) == -1, 1, 0)
+
</code_context>

<issue_to_address>
No test for error handling when input data contains non-numeric values.

Add a test with non-numeric input to confirm that ValueError is raised by PairwiseDifferenceOutlierDetection.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH
    assert len(np.unique(y_train)) == 2
    assert len(np.unique(y_test)) == 2
=======
    assert len(np.unique(y_train)) == 2
    assert len(np.unique(y_test)) == 2

    # Error handling test: non-numeric input for PairwiseDifferenceOutlierDetection
    import pytest

    X_non_numeric = [['a', 'b'], ['c', 'd']]
    y_dummy = [0, 1]

    from unsupervised_outlier_detection import PairwiseDifferenceOutlierDetection

    with pytest.raises(ValueError):
        model = PairwiseDifferenceOutlierDetection()
        model.fit(X_non_numeric, y_dummy)
>>>>>>> REPLACE

</suggested_fix>

### Comment 8
<location> `tests/unsupervised_outlier_detection.py:96` </location>
<code_context>
+        "AUC_PR": auc_pr_iso
+    })
+
+    # Train PairwiseDifferenceOutlierDetection
+    model = PairwiseDifferenceOutlierDetection()
+
+    # >>> Threshold is optimized inside 'fit()' based on mean anomaly scores
+    # >>> Threshold is found on X_train_pair with best F1 macro (based on y_train)
+    model.fit(pd.DataFrame(X_train), y_train)   # Important "i did it already in PairwiseDifferenceOutlierDetection": Pass X_train_Pair as DataFrame
+
+    # Predict using the best found threshold 
</code_context>

<issue_to_address>
No test for estimator behavior when all samples are identical.

Add a test with identical input samples to verify the estimator's robustness and correctness in this edge case.
</issue_to_address>

### Comment 9
<location> `tests/unsupervised_outlier_detection.py:107` </location>
<code_context>
+    # >>> Prediction is made on X_test by applying the selected best threshold from training
+    y_pred_pdl = model.predict(pd.DataFrame(X_test))
+
+    # Metrics for PairwiseDifferenceOutlierDetection
+    f1_pdl = f1_score(y_test, y_pred_pdl, average='macro')
+    auc_roc_pdl = roc_auc_score(y_test, y_pred_pdl)
+    auc_pr_pdl = average_precision_score(y_test, y_pred_pdl)
+
+    all_results.append({
</code_context>

<issue_to_address>
No test for estimator behavior when input contains NaN or infinite values.

Add a test with NaN or infinite values in the input to ensure the estimator responds correctly.

Suggested implementation:

```python
    model = PairwiseDifferenceOutlierDetection()

    # >>> Threshold is optimized inside 'fit()' based on mean anomaly scores
    # >>> Threshold is found on X_train_pair with best F1 macro (based on y_train)
    model.fit(pd.DataFrame(X_train), y_train)   # Important "i did it already in PairwiseDifferenceOutlierDetection": Pass X_train_Pair as DataFrame

    # Predict using the best found threshold 
    # >>> Prediction is made on X_test by applying the selected best threshold from training
    y_pred_pdl = model.predict(pd.DataFrame(X_test))

    # Test estimator behavior with NaN and infinite values in input
    import numpy as np
    import pytest

    X_test_nan = pd.DataFrame(X_test.copy())
    X_test_nan.iloc[0, 0] = np.nan
    X_test_inf = pd.DataFrame(X_test.copy())
    X_test_inf.iloc[0, 0] = np.inf

    # Check for NaN
    with pytest.raises(ValueError):
        model.predict(X_test_nan)

    # Check for infinite
    with pytest.raises(ValueError):
        model.predict(X_test_inf)

```

- Ensure that `pytest` is available in your test environment.
- If the estimator does not currently raise a `ValueError` for NaN or infinite values, you may need to update the estimator implementation to do so.
- If you use a different test framework, adjust the exception assertion accordingly.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

sourcery-ai · 2025-08-27T14:28:04Z

+    X_pair = X1.merge(X2, how="cross")
+
+    # Extract and rename columns for difference calculation


suggestion (performance): Cross-join for pairwise differences may have high memory and performance cost.

For large inputs, this approach may cause memory issues or slowdowns. Consider implementing checks for input size or using a more efficient method.

Suggested change

X_pair = X1.merge(X2, how="cross")

# Extract and rename columns for difference calculation

# Check for large input sizes before performing cross-join

max_cross_rows = 1_000_000 # threshold for warning/error

expected_rows = len(X1) * len(X2)

if expected_rows > max_cross_rows:

raise ValueError(

f"PairwiseDifference: Cross-join would result in {expected_rows} rows, "

"which may cause memory or performance issues. "

"Consider using smaller inputs or a more efficient method."

)

X_pair = X1.merge(X2, how="cross")

# Extract and rename columns for difference calculation

sourcery-ai · 2025-08-27T14:28:04Z

+    try:
+        calculate_difference = x1_pair - x2_pair
+    except Exception as e:
+        raise ValueError("PairwiseDifference: Non-numeric data found.") from e


suggestion: Exception handling for non-numeric data is broad.

Catching only relevant exceptions like TypeError will help ensure other errors are not inadvertently ignored.

Suggested change

try:

calculate_difference = x1_pair - x2_pair

except Exception as e:

raise ValueError("PairwiseDifference: Non-numeric data found.") from e

try:

calculate_difference = x1_pair - x2_pair

except TypeError as e:

raise ValueError("PairwiseDifference: Non-numeric data found.") from e

sourcery-ai · 2025-08-27T14:28:04Z

+    except Exception as e:
+        raise ValueError("PairwiseDifference: Non-numeric data found.") from e
+
+    # Concatenate the original cross-join and calculated differences


suggestion: Concatenation of DataFrames may result in duplicate columns or misalignment.

Check for overlapping column names before concatenation and validate the output DataFrame's structure and columns.

Suggested change

# Concatenate the original cross-join and calculated differences

# Check for overlapping column names before concatenation

overlap = set(X_pair.columns) & set(calculate_difference.columns)

if overlap:

raise ValueError(f"PairwiseDifference: Overlapping column names detected during concatenation: {overlap}")

# Concatenate the original cross-join and calculated differences

X_pair = pd.concat([X_pair, calculate_difference], axis='columns')

# Validate the output DataFrame's structure and columns

expected_columns = list(X_pair.columns) # Adjust this as needed for your use case

if X_pair.columns.duplicated().any():

raise ValueError("PairwiseDifference: Duplicate columns found in the output DataFrame after concatenation.")

sourcery-ai · 2025-08-27T14:28:04Z

+    x1_pair_sym = X_pair[[f'{col}_y' for col in X1.columns]].rename(columns={f'{col}_y': f'{col}_x' for col in X1.columns})
+    X_pair_sym = pd.concat([x1_pair_sym, x2_pair_sym, x2_pair - x1_pair], axis='columns')
+
+    return X_pair, X_pair_sym


suggestion: Returned symmetric feature differences are not used elsewhere in the class.

Consider removing symmetric feature differences from the return value if they are not needed, to keep the interface clean.

Suggested implementation:

return X_pair

sourcery-ai · 2025-08-27T14:28:04Z

+    # Train/Test Split (without random_state, fully random split)
+    X_train, X_test, y_train, y_test = train_test_split(X_sampled, y_sampled, test_size=0.4)


suggestion (testing): Test split does not use a fixed random_state, which may lead to non-reproducible results.

Set random_state in train_test_split for consistent and reproducible test splits.

Suggested change

# Train/Test Split (without random_state, fully random split)

X_train, X_test, y_train, y_test = train_test_split(X_sampled, y_sampled, test_size=0.4)

# Train/Test Split (with fixed random_state for reproducibility)

X_train, X_test, y_train, y_test = train_test_split(X_sampled, y_sampled, test_size=0.4, random_state=42)

sourcery-ai · 2025-08-27T14:28:04Z

+    assert len(np.unique(y_train)) == 2
+    assert len(np.unique(y_test)) == 2


suggestion (testing): No test for error handling when input data contains non-numeric values.

Add a test with non-numeric input to confirm that ValueError is raised by PairwiseDifferenceOutlierDetection.

Suggested change

assert len(np.unique(y_train)) == 2

assert len(np.unique(y_test)) == 2

assert len(np.unique(y_train)) == 2

assert len(np.unique(y_test)) == 2

# Error handling test: non-numeric input for PairwiseDifferenceOutlierDetection

import pytest

X_non_numeric = [['a', 'b'], ['c', 'd']]

y_dummy = [0, 1]

from unsupervised_outlier_detection import PairwiseDifferenceOutlierDetection

with pytest.raises(ValueError):

model = PairwiseDifferenceOutlierDetection()

model.fit(X_non_numeric, y_dummy)

sourcery-ai · 2025-08-27T14:28:04Z

+    # Train PairwiseDifferenceOutlierDetection
+    model = PairwiseDifferenceOutlierDetection()
+
+    # >>> Threshold is optimized inside 'fit()' based on mean anomaly scores
+    # >>> Threshold is found on X_train_pair with best F1 macro (based on y_train)
+    model.fit(pd.DataFrame(X_train), y_train)   # Important "i did it already in PairwiseDifferenceOutlierDetection": Pass X_train_Pair as DataFrame


suggestion (testing): No test for estimator behavior when all samples are identical.

Add a test with identical input samples to verify the estimator's robustness and correctness in this edge case.

sourcery-ai · 2025-08-27T14:28:04Z

+    # Metrics for PairwiseDifferenceOutlierDetection
+    f1_pdl = f1_score(y_test, y_pred_pdl, average='macro')
+    auc_roc_pdl = roc_auc_score(y_test, y_pred_pdl)
+    auc_pr_pdl = average_precision_score(y_test, y_pred_pdl)


suggestion (testing): No test for estimator behavior when input contains NaN or infinite values.

Add a test with NaN or infinite values in the input to ensure the estimator responds correctly.

Suggested implementation:

model = PairwiseDifferenceOutlierDetection() # >>> Threshold is optimized inside 'fit()' based on mean anomaly scores # >>> Threshold is found on X_train_pair with best F1 macro (based on y_train) model.fit(pd.DataFrame(X_train), y_train) # Important "i did it already in PairwiseDifferenceOutlierDetection": Pass X_train_Pair as DataFrame # Predict using the best found threshold # >>> Prediction is made on X_test by applying the selected best threshold from training y_pred_pdl = model.predict(pd.DataFrame(X_test)) # Test estimator behavior with NaN and infinite values in input import numpy as np import pytest X_test_nan = pd.DataFrame(X_test.copy()) X_test_nan.iloc[0, 0] = np.nan X_test_inf = pd.DataFrame(X_test.copy()) X_test_inf.iloc[0, 0] = np.inf # Check for NaN with pytest.raises(ValueError): model.predict(X_test_nan) # Check for infinite with pytest.raises(ValueError): model.predict(X_test_inf)

Ensure that pytest is available in your test environment.

If the estimator does not currently raise a ValueError for NaN or infinite values, you may need to update the estimator implementation to do so.

If you use a different test framework, adjust the exception assertion accordingly.

Co-authored-by: Karim-53 <mohamedkarim.belaid@supcom.tn>

Karim-53 · 2025-08-27T14:35:44Z

+# PairwiseDifferenceOutlierDetection class: Author Nizar Kadri <nizar.kadri@campus.lmu.de>
+# License: Apache-2.0 clause
+


move to line 2

Karim-53 · 2025-08-27T14:37:15Z

+import seaborn as sns
+


don't do import that are not used -.-'
same in the other .py
you can check that with VS code or with Copilot

Karim-53 · 2025-08-27T14:48:03Z

+    self.estimator = estimator if estimator is not None else IsolationForest()
+    self.classifier = self.estimator
+
+def fit(self, X, y):


Since this is an unsupervised outlier detection y=None should be the default option

Since this is an unsupervised outlier detection y=None should be the default option:

if y is not None:
# supervised mode → F1-Optimierung
candidate_percentiles = np.arange(1, 100, 1)
best_f1, best_threshold, best_percentile = -np.inf, None, None

for p in candidate_percentiles: threshold = np.percentile(mean_scores, p) y_pred = np.where(mean_scores < threshold, 0, 1) f1 = f1_score(y, y_pred, average='macro') if f1 > best_f1: best_f1, best_threshold, best_percentile = f1, threshold, p self.threshold_ = best_threshold self.percentile_ = best_percentile print(f"[supervised] Best percentile={best_percentile}, F1={best_f1:.4f}") else: # unsupervised mode → Threshold aus contamination perc = 100 * contamination self.threshold_ = np.percentile(mean_scores, perc) self.percentile_ = perc print(f"[unsupervised] Threshold set at {perc:.1f}th percentile") return self

I have extended the method so that it also works in a fully unsupervised manner —
that is, without using any label information.
However, it turned out that the results (F1, AUC) became worse,
since F1-based threshold tuning is no longer possible.
I am currently working on testing more adaptive thresholding strategies
to improve the performance of the unsupervised approach.

Dataset 1: celeba_baldvsnonbald_normalised.csv
[unsupervised] Threshold set at 10.0th percentile
Dataset 2: bank-additional-full_normalised.csv
[unsupervised] Threshold set at 10.0th percentile
Summary:
Dataset Method Percentile F1_macro
0 Dataset 1 IsolationForest baseline 0.498930
1 Dataset 1 PairwiseDifferenceOutlierDetection 10.0 0.106228
2 Dataset 2 IsolationForest baseline 0.620101
3 Dataset 2 PairwiseDifferenceOutlierDetection 10.0 0.114761

AUC_ROC AUC_PR
0 0.601691 0.074599
1 0.350866 0.045190
2 0.631384 0.133820
3 0.353317 0.054382

Karim-53 · 2025-08-27T14:48:33Z

+from sklearn.linear_model import LogisticRegression
+from sklearn.neighbors import NearestNeighbors
+from sklearn.ensemble import IsolationForest
+from sklearn.datasets import fetch_openml


try to shorten the imports to what you really need

Karim-53 · 2025-08-27T14:49:11Z

+#  Dictionary of dataset file paths
+file_paths = {
+    1: "/datasets/celeba_baldvsnonbald_normalised.csv",
+    2: "/datasets/census-income-full-mixed-binarized.csv",


this is still missing in the repo (you did not upload it)

Karim-53 · 2025-08-27T14:50:16Z

-numpy==1.26.4
+numpy~=1.26.4
 pandas~=2.2.0
 scikit-learn~=1.3.2
-tqdm
+tqdm~=4.66.2
 openml~=0.14.2
 psutil~=5.9.8
 joblib~=1.2.0
 scipy~=1.11.4
-fastparquet
+fastparquet~=2024.2.0


no. revert changes

Karim-53 · 2025-08-27T15:10:29Z

+    self.estimator.fit(X_train_pair)
+
+    # Calculate anomaly scores on the training pairs
+    scores = abs(self.estimator.score_samples(X_train_pair))


don’t take abs, keep raw scores, and treat lower as more abnormal. Also, since you subclass OutlierMixin, align with scikit-learn convention: predict should return 1 for inliers and -1 for outliers.

- scores = abs(self.estimator.score_samples(X_train_pair)) + scores = self.estimator.score_samples(X_train_pair) score_df = pd.DataFrame(scores.reshape((-1, len(self.X_train)))) mean_scores = score_df.mean(axis=1).to_numpy() - y_pred = np.where(mean_scores < threshold, 0, 1) + # 1 = inlier, -1 = outlier to match sklearn + y_pred = np.where(mean_scores < threshold, -1, 1)

def predict(self, X): - scores = self.score_samples(X) - return np.where(scores < self.threshold_, 0, 1) + scores = self.score_samples(X) + return np.where(scores < self.threshold_, -1, 1)

Karim-53 · 2025-08-27T15:14:28Z

+    - Pairs all training samples (X_train × X_train)
+    - Trains the Isolation Forest on pairwise differences
+    - Searches for the best percentile threshold (1% to 99%) that maximizes F1-macro score
+    """


set common sklearn attrs:

+ from sklearn.utils.validation import check_array, check_is_fitted + X = check_array(X, force_all_finite="allow-nan") + self.n_features_in_ = X.shape[1] + self.feature_names_in_ = getattr(X, "columns", None)

Karim-53 · 2025-08-27T15:14:59Z

+    self.estimator.fit(X)
+    self.X_train = X
+
+    # Create pairwise differences for training
+    X_train_pair, _ = self.__pair_input(X, X)
+    self.estimator.fit(X_train_pair)


why two .fit() ?????

Karim-53 · 2025-08-27T15:16:03Z

+import warnings
+warnings.filterwarnings('ignore')


Suggested change

import warnings

warnings.filterwarnings('ignore')

Karim-53 · 2025-08-27T15:17:58Z

+    - estimator: A scikit-learn compatible estimator (default is IsolationForest).
+    """
+    self.estimator = estimator if estimator is not None else IsolationForest()
+    self.classifier = self.estimator


this is never used delete it

Suggested change

self.classifier = self.estimator

Karim-53 · 2025-08-27T15:18:35Z

+    score_df = pd.DataFrame(scores.reshape((-1, len(self.X_train))))
+    return score_df.mean(axis=1).to_numpy()
+
+def decision_function(self, X):


Make it consistent with scikit-learn: positive = more normal.

Karim-53 · 2025-08-27T15:26:28Z

+# Save all results
+df_results = pd.DataFrame(all_results)
+print("Summary:")
+print(df_results)


add assert on the expected performance
for example
assert F1_macro > 0.7 # the exact value should be arround 0.71

Karim-53 · 2025-08-27T15:30:46Z

@@ -0,0 +1,124 @@
+# Data manipulation and analysis


rename this file and move it to root
run_benchmark_outlier_detection.py

then add a simple example in folder examples/ and if you want , some unit tests in tests

Karim-53 · 2025-08-30T05:55:42Z

@codex propose a fix to the failing github actions

chatgpt-codex-connector · 2025-08-30T05:55:49Z

@codex propose a fix to the failing github actions

For now, I can only help with PRs you've created.

Nizar Kadri and others added 2 commits August 26, 2025 20:23

added PairwiseDifferenceOutlierDetection

b23a55e

add Outlier detection Exp

f777fef

Karim-53 reviewed Aug 27, 2025

View reviewed changes

Comment thread pdll/_pairwise.py Outdated

sourcery-ai Bot reviewed Aug 27, 2025

View reviewed changes

Update pdll/_pairwise.py

bac16f2

Co-authored-by: Karim-53 <mohamedkarim.belaid@supcom.tn>

Karim-53 reviewed Aug 27, 2025

View reviewed changes

		X_pair = X1.merge(X2, how="cross")

		# Extract and rename columns for difference calculation

		# Train/Test Split (without random_state, fully random split)
		X_train, X_test, y_train, y_test = train_test_split(X_sampled, y_sampled, test_size=0.4)

		assert len(np.unique(y_train)) == 2
		assert len(np.unique(y_test)) == 2

		# PairwiseDifferenceOutlierDetection class: Author Nizar Kadri <nizar.kadri@campus.lmu.de>
		# License: Apache-2.0 clause

Conversation

Kadrinizar commented Aug 27, 2025 • edited by sourcery-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by Sourcery

Uh oh!

sourcery-ai Bot commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Sequence diagram for training and threshold selection in PairwiseDifferenceOutlierDetection

Sequence diagram for anomaly prediction in PairwiseDifferenceOutlierDetection

Class diagram for PairwiseDifferenceOutlierDetection estimator

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

Uh oh!

sourcery-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sourcery-ai Bot Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai Bot Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai Bot Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai Bot Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai Bot Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai Bot Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai Bot Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai Bot Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Karim-53 Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Kadrinizar commented Aug 27, 2025 •

edited by sourcery-ai Bot

Loading

sourcery-ai Bot commented Aug 27, 2025 •

edited

Loading

Karim-53 Aug 27, 2025 •

edited

Loading