pairwise difference outlier detection#7
Conversation
Reviewer's GuideThis PR adds a custom PairwiseDifferenceOutlierDetection estimator that learns on pairwise feature differences with an automatic percentile-based threshold selection optimizing F1-macro. It also adjusts dependency versions and introduces a new evaluation script to benchmark the new detector against IsolationForest. Sequence diagram for training and threshold selection in PairwiseDifferenceOutlierDetectionsequenceDiagram
participant User
participant PairwiseDifferenceOutlierDetection
participant IsolationForest
User->>PairwiseDifferenceOutlierDetection: fit(X, y)
PairwiseDifferenceOutlierDetection->>PairwiseDifferenceOutlierDetection: __pair_input(X, X)
PairwiseDifferenceOutlierDetection->>IsolationForest: fit(pairwise differences)
PairwiseDifferenceOutlierDetection->>IsolationForest: score_samples(pairwise differences)
PairwiseDifferenceOutlierDetection->>PairwiseDifferenceOutlierDetection: Select threshold maximizing F1-macro
PairwiseDifferenceOutlierDetection-->>User: return self
Sequence diagram for anomaly prediction in PairwiseDifferenceOutlierDetectionsequenceDiagram
participant User
participant PairwiseDifferenceOutlierDetection
participant IsolationForest
User->>PairwiseDifferenceOutlierDetection: predict(X)
PairwiseDifferenceOutlierDetection->>PairwiseDifferenceOutlierDetection: score_samples(X)
PairwiseDifferenceOutlierDetection->>IsolationForest: score_samples(pairwise differences to train set)
PairwiseDifferenceOutlierDetection->>PairwiseDifferenceOutlierDetection: Compare scores to threshold_
PairwiseDifferenceOutlierDetection-->>User: return predicted labels
Class diagram for PairwiseDifferenceOutlierDetection estimatorclassDiagram
class PairwiseDifferenceOutlierDetection {
+estimator
+classifier
+X_train
+threshold_
+percentile_
+__init__(estimator=None)
+fit(X, y)
+predict(X)
+score_samples(X)
+decision_function(X)
+__pair_input(X1, X2)
}
PairwiseDifferenceOutlierDetection --|> BaseEstimator
PairwiseDifferenceOutlierDetection --|> OutlierMixin
PairwiseDifferenceOutlierDetection o-- IsolationForest : uses as default estimator
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey there - I've reviewed your changes and they look great!
Prompt for AI Agents
Please address the comments from this code review:
## Individual Comments
### Comment 1
<location> `pdll/_pairwise.py:1037` </location>
<code_context>
+ self.threshold_ = best_threshold
+ self.percentile_ = best_percentile
+
+ print(f" Best percentile selected: {self.percentile_} with F1-macro: {best_f1:.4f}")
+
+ return self
</code_context>
<issue_to_address>
Use of print statement for logging may not be appropriate in production code.
Replace the print statement with a logging call for improved output management and integration.
Suggested implementation:
```python
import logging
logging.info(f"Best percentile selected: {self.percentile_} with F1-macro: {best_f1:.4f}")
```
If the file already imports `logging` at the top, you can omit the `import logging` line in the replacement. For best practice, ensure logging is configured somewhere in your codebase (e.g., with `logging.basicConfig(level=logging.INFO)` in your main entrypoint).
</issue_to_address>
### Comment 2
<location> `pdll/_pairwise.py:1093` </location>
<code_context>
+ Returns:
+ - tuple: (paired feature differences, symmetric feature differences)
+ """
+ X_pair = X1.merge(X2, how="cross")
+
+ # Extract and rename columns for difference calculation
</code_context>
<issue_to_address>
Cross-join for pairwise differences may have high memory and performance cost.
For large inputs, this approach may cause memory issues or slowdowns. Consider implementing checks for input size or using a more efficient method.
</issue_to_address>
<suggested_fix>
<<<<<<< SEARCH
X_pair = X1.merge(X2, how="cross")
# Extract and rename columns for difference calculation
=======
# Check for large input sizes before performing cross-join
max_cross_rows = 1_000_000 # threshold for warning/error
expected_rows = len(X1) * len(X2)
if expected_rows > max_cross_rows:
raise ValueError(
f"PairwiseDifference: Cross-join would result in {expected_rows} rows, "
"which may cause memory or performance issues. "
"Consider using smaller inputs or a more efficient method."
)
X_pair = X1.merge(X2, how="cross")
# Extract and rename columns for difference calculation
>>>>>>> REPLACE
</suggested_fix>
### Comment 3
<location> `pdll/_pairwise.py:1099` </location>
<code_context>
+ x1_pair = X_pair[[f'{col}_x' for col in X1.columns]].rename(columns={f'{col}_x': f'{col}_diff' for col in X1.columns})
+ x2_pair = X_pair[[f'{col}_y' for col in X1.columns]].rename(columns={f'{col}_y': f'{col}_diff' for col in X1.columns})
+
+ try:
+ calculate_difference = x1_pair - x2_pair
+ except Exception as e:
</code_context>
<issue_to_address>
Exception handling for non-numeric data is broad.
Catching only relevant exceptions like TypeError will help ensure other errors are not inadvertently ignored.
</issue_to_address>
<suggested_fix>
<<<<<<< SEARCH
try:
calculate_difference = x1_pair - x2_pair
except Exception as e:
raise ValueError("PairwiseDifference: Non-numeric data found.") from e
=======
try:
calculate_difference = x1_pair - x2_pair
except TypeError as e:
raise ValueError("PairwiseDifference: Non-numeric data found.") from e
>>>>>>> REPLACE
</suggested_fix>
### Comment 4
<location> `pdll/_pairwise.py:1105` </location>
<code_context>
+ raise ValueError("PairwiseDifference: Non-numeric data found.") from e
+
+ # Concatenate the original cross-join and calculated differences
+ X_pair = pd.concat([X_pair, calculate_difference], axis='columns')
+
+
</code_context>
<issue_to_address>
Concatenation of DataFrames may result in duplicate columns or misalignment.
Check for overlapping column names before concatenation and validate the output DataFrame's structure and columns.
</issue_to_address>
<suggested_fix>
<<<<<<< SEARCH
# Concatenate the original cross-join and calculated differences
=======
# Check for overlapping column names before concatenation
overlap = set(X_pair.columns) & set(calculate_difference.columns)
if overlap:
raise ValueError(f"PairwiseDifference: Overlapping column names detected during concatenation: {overlap}")
# Concatenate the original cross-join and calculated differences
X_pair = pd.concat([X_pair, calculate_difference], axis='columns')
# Validate the output DataFrame's structure and columns
expected_columns = list(X_pair.columns) # Adjust this as needed for your use case
if X_pair.columns.duplicated().any():
raise ValueError("PairwiseDifference: Duplicate columns found in the output DataFrame after concatenation.")
>>>>>>> REPLACE
</suggested_fix>
### Comment 5
<location> `pdll/_pairwise.py:1113` </location>
<code_context>
+ x1_pair_sym = X_pair[[f'{col}_y' for col in X1.columns]].rename(columns={f'{col}_y': f'{col}_x' for col in X1.columns})
+ X_pair_sym = pd.concat([x1_pair_sym, x2_pair_sym, x2_pair - x1_pair], axis='columns')
+
+ return X_pair, X_pair_sym
+
+
</code_context>
<issue_to_address>
Returned symmetric feature differences are not used elsewhere in the class.
Consider removing symmetric feature differences from the return value if they are not needed, to keep the interface clean.
Suggested implementation:
```python
```
```python
return X_pair
```
</issue_to_address>
### Comment 6
<location> `tests/unsupervised_outlier_detection.py:72` </location>
<code_context>
+ X_sampled = df_sampled.drop(columns=['class']).values
+ y_sampled = df_sampled['class'].values
+
+ # Train/Test Split (without random_state, fully random split)
+ X_train, X_test, y_train, y_test = train_test_split(X_sampled, y_sampled, test_size=0.4)
+ assert len(np.unique(y_train)) == 2
+ assert len(np.unique(y_test)) == 2
</code_context>
<issue_to_address>
Test split does not use a fixed random_state, which may lead to non-reproducible results.
Set random_state in train_test_split for consistent and reproducible test splits.
</issue_to_address>
<suggested_fix>
<<<<<<< SEARCH
# Train/Test Split (without random_state, fully random split)
X_train, X_test, y_train, y_test = train_test_split(X_sampled, y_sampled, test_size=0.4)
=======
# Train/Test Split (with fixed random_state for reproducibility)
X_train, X_test, y_train, y_test = train_test_split(X_sampled, y_sampled, test_size=0.4, random_state=42)
>>>>>>> REPLACE
</suggested_fix>
### Comment 7
<location> `tests/unsupervised_outlier_detection.py:77` </location>
<code_context>
+ assert len(np.unique(y_train)) == 2
+ assert len(np.unique(y_test)) == 2
+
+ # Baseline: IsolationForest
+ iso_forest = IsolationForest()
+ iso_forest.fit(X_train)
+ y_pred_iso = np.where(iso_forest.predict(X_test) == -1, 1, 0)
+
</code_context>
<issue_to_address>
No test for error handling when input data contains non-numeric values.
Add a test with non-numeric input to confirm that ValueError is raised by PairwiseDifferenceOutlierDetection.
</issue_to_address>
<suggested_fix>
<<<<<<< SEARCH
assert len(np.unique(y_train)) == 2
assert len(np.unique(y_test)) == 2
=======
assert len(np.unique(y_train)) == 2
assert len(np.unique(y_test)) == 2
# Error handling test: non-numeric input for PairwiseDifferenceOutlierDetection
import pytest
X_non_numeric = [['a', 'b'], ['c', 'd']]
y_dummy = [0, 1]
from unsupervised_outlier_detection import PairwiseDifferenceOutlierDetection
with pytest.raises(ValueError):
model = PairwiseDifferenceOutlierDetection()
model.fit(X_non_numeric, y_dummy)
>>>>>>> REPLACE
</suggested_fix>
### Comment 8
<location> `tests/unsupervised_outlier_detection.py:96` </location>
<code_context>
+ "AUC_PR": auc_pr_iso
+ })
+
+ # Train PairwiseDifferenceOutlierDetection
+ model = PairwiseDifferenceOutlierDetection()
+
+ # >>> Threshold is optimized inside 'fit()' based on mean anomaly scores
+ # >>> Threshold is found on X_train_pair with best F1 macro (based on y_train)
+ model.fit(pd.DataFrame(X_train), y_train) # Important "i did it already in PairwiseDifferenceOutlierDetection": Pass X_train_Pair as DataFrame
+
+ # Predict using the best found threshold
</code_context>
<issue_to_address>
No test for estimator behavior when all samples are identical.
Add a test with identical input samples to verify the estimator's robustness and correctness in this edge case.
</issue_to_address>
### Comment 9
<location> `tests/unsupervised_outlier_detection.py:107` </location>
<code_context>
+ # >>> Prediction is made on X_test by applying the selected best threshold from training
+ y_pred_pdl = model.predict(pd.DataFrame(X_test))
+
+ # Metrics for PairwiseDifferenceOutlierDetection
+ f1_pdl = f1_score(y_test, y_pred_pdl, average='macro')
+ auc_roc_pdl = roc_auc_score(y_test, y_pred_pdl)
+ auc_pr_pdl = average_precision_score(y_test, y_pred_pdl)
+
+ all_results.append({
</code_context>
<issue_to_address>
No test for estimator behavior when input contains NaN or infinite values.
Add a test with NaN or infinite values in the input to ensure the estimator responds correctly.
Suggested implementation:
```python
model = PairwiseDifferenceOutlierDetection()
# >>> Threshold is optimized inside 'fit()' based on mean anomaly scores
# >>> Threshold is found on X_train_pair with best F1 macro (based on y_train)
model.fit(pd.DataFrame(X_train), y_train) # Important "i did it already in PairwiseDifferenceOutlierDetection": Pass X_train_Pair as DataFrame
# Predict using the best found threshold
# >>> Prediction is made on X_test by applying the selected best threshold from training
y_pred_pdl = model.predict(pd.DataFrame(X_test))
# Test estimator behavior with NaN and infinite values in input
import numpy as np
import pytest
X_test_nan = pd.DataFrame(X_test.copy())
X_test_nan.iloc[0, 0] = np.nan
X_test_inf = pd.DataFrame(X_test.copy())
X_test_inf.iloc[0, 0] = np.inf
# Check for NaN
with pytest.raises(ValueError):
model.predict(X_test_nan)
# Check for infinite
with pytest.raises(ValueError):
model.predict(X_test_inf)
```
- Ensure that `pytest` is available in your test environment.
- If the estimator does not currently raise a `ValueError` for NaN or infinite values, you may need to update the estimator implementation to do so.
- If you use a different test framework, adjust the exception assertion accordingly.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| X_pair = X1.merge(X2, how="cross") | ||
|
|
||
| # Extract and rename columns for difference calculation |
There was a problem hiding this comment.
suggestion (performance): Cross-join for pairwise differences may have high memory and performance cost.
For large inputs, this approach may cause memory issues or slowdowns. Consider implementing checks for input size or using a more efficient method.
| X_pair = X1.merge(X2, how="cross") | |
| # Extract and rename columns for difference calculation | |
| # Check for large input sizes before performing cross-join | |
| max_cross_rows = 1_000_000 # threshold for warning/error | |
| expected_rows = len(X1) * len(X2) | |
| if expected_rows > max_cross_rows: | |
| raise ValueError( | |
| f"PairwiseDifference: Cross-join would result in {expected_rows} rows, " | |
| "which may cause memory or performance issues. " | |
| "Consider using smaller inputs or a more efficient method." | |
| ) | |
| X_pair = X1.merge(X2, how="cross") | |
| # Extract and rename columns for difference calculation |
| try: | ||
| calculate_difference = x1_pair - x2_pair | ||
| except Exception as e: | ||
| raise ValueError("PairwiseDifference: Non-numeric data found.") from e |
There was a problem hiding this comment.
suggestion: Exception handling for non-numeric data is broad.
Catching only relevant exceptions like TypeError will help ensure other errors are not inadvertently ignored.
| try: | |
| calculate_difference = x1_pair - x2_pair | |
| except Exception as e: | |
| raise ValueError("PairwiseDifference: Non-numeric data found.") from e | |
| try: | |
| calculate_difference = x1_pair - x2_pair | |
| except TypeError as e: | |
| raise ValueError("PairwiseDifference: Non-numeric data found.") from e |
| except Exception as e: | ||
| raise ValueError("PairwiseDifference: Non-numeric data found.") from e | ||
|
|
||
| # Concatenate the original cross-join and calculated differences |
There was a problem hiding this comment.
suggestion: Concatenation of DataFrames may result in duplicate columns or misalignment.
Check for overlapping column names before concatenation and validate the output DataFrame's structure and columns.
| # Concatenate the original cross-join and calculated differences | |
| # Check for overlapping column names before concatenation | |
| overlap = set(X_pair.columns) & set(calculate_difference.columns) | |
| if overlap: | |
| raise ValueError(f"PairwiseDifference: Overlapping column names detected during concatenation: {overlap}") | |
| # Concatenate the original cross-join and calculated differences | |
| X_pair = pd.concat([X_pair, calculate_difference], axis='columns') | |
| # Validate the output DataFrame's structure and columns | |
| expected_columns = list(X_pair.columns) # Adjust this as needed for your use case | |
| if X_pair.columns.duplicated().any(): | |
| raise ValueError("PairwiseDifference: Duplicate columns found in the output DataFrame after concatenation.") |
| x1_pair_sym = X_pair[[f'{col}_y' for col in X1.columns]].rename(columns={f'{col}_y': f'{col}_x' for col in X1.columns}) | ||
| X_pair_sym = pd.concat([x1_pair_sym, x2_pair_sym, x2_pair - x1_pair], axis='columns') | ||
|
|
||
| return X_pair, X_pair_sym |
There was a problem hiding this comment.
suggestion: Returned symmetric feature differences are not used elsewhere in the class.
Consider removing symmetric feature differences from the return value if they are not needed, to keep the interface clean.
Suggested implementation:
return X_pair| # Train/Test Split (without random_state, fully random split) | ||
| X_train, X_test, y_train, y_test = train_test_split(X_sampled, y_sampled, test_size=0.4) |
There was a problem hiding this comment.
suggestion (testing): Test split does not use a fixed random_state, which may lead to non-reproducible results.
Set random_state in train_test_split for consistent and reproducible test splits.
| # Train/Test Split (without random_state, fully random split) | |
| X_train, X_test, y_train, y_test = train_test_split(X_sampled, y_sampled, test_size=0.4) | |
| # Train/Test Split (with fixed random_state for reproducibility) | |
| X_train, X_test, y_train, y_test = train_test_split(X_sampled, y_sampled, test_size=0.4, random_state=42) |
| assert len(np.unique(y_train)) == 2 | ||
| assert len(np.unique(y_test)) == 2 |
There was a problem hiding this comment.
suggestion (testing): No test for error handling when input data contains non-numeric values.
Add a test with non-numeric input to confirm that ValueError is raised by PairwiseDifferenceOutlierDetection.
| assert len(np.unique(y_train)) == 2 | |
| assert len(np.unique(y_test)) == 2 | |
| assert len(np.unique(y_train)) == 2 | |
| assert len(np.unique(y_test)) == 2 | |
| # Error handling test: non-numeric input for PairwiseDifferenceOutlierDetection | |
| import pytest | |
| X_non_numeric = [['a', 'b'], ['c', 'd']] | |
| y_dummy = [0, 1] | |
| from unsupervised_outlier_detection import PairwiseDifferenceOutlierDetection | |
| with pytest.raises(ValueError): | |
| model = PairwiseDifferenceOutlierDetection() | |
| model.fit(X_non_numeric, y_dummy) |
| # Train PairwiseDifferenceOutlierDetection | ||
| model = PairwiseDifferenceOutlierDetection() | ||
|
|
||
| # >>> Threshold is optimized inside 'fit()' based on mean anomaly scores | ||
| # >>> Threshold is found on X_train_pair with best F1 macro (based on y_train) | ||
| model.fit(pd.DataFrame(X_train), y_train) # Important "i did it already in PairwiseDifferenceOutlierDetection": Pass X_train_Pair as DataFrame |
There was a problem hiding this comment.
suggestion (testing): No test for estimator behavior when all samples are identical.
Add a test with identical input samples to verify the estimator's robustness and correctness in this edge case.
| # Metrics for PairwiseDifferenceOutlierDetection | ||
| f1_pdl = f1_score(y_test, y_pred_pdl, average='macro') | ||
| auc_roc_pdl = roc_auc_score(y_test, y_pred_pdl) | ||
| auc_pr_pdl = average_precision_score(y_test, y_pred_pdl) |
There was a problem hiding this comment.
suggestion (testing): No test for estimator behavior when input contains NaN or infinite values.
Add a test with NaN or infinite values in the input to ensure the estimator responds correctly.
Suggested implementation:
model = PairwiseDifferenceOutlierDetection()
# >>> Threshold is optimized inside 'fit()' based on mean anomaly scores
# >>> Threshold is found on X_train_pair with best F1 macro (based on y_train)
model.fit(pd.DataFrame(X_train), y_train) # Important "i did it already in PairwiseDifferenceOutlierDetection": Pass X_train_Pair as DataFrame
# Predict using the best found threshold
# >>> Prediction is made on X_test by applying the selected best threshold from training
y_pred_pdl = model.predict(pd.DataFrame(X_test))
# Test estimator behavior with NaN and infinite values in input
import numpy as np
import pytest
X_test_nan = pd.DataFrame(X_test.copy())
X_test_nan.iloc[0, 0] = np.nan
X_test_inf = pd.DataFrame(X_test.copy())
X_test_inf.iloc[0, 0] = np.inf
# Check for NaN
with pytest.raises(ValueError):
model.predict(X_test_nan)
# Check for infinite
with pytest.raises(ValueError):
model.predict(X_test_inf)- Ensure that
pytestis available in your test environment. - If the estimator does not currently raise a
ValueErrorfor NaN or infinite values, you may need to update the estimator implementation to do so. - If you use a different test framework, adjust the exception assertion accordingly.
Co-authored-by: Karim-53 <mohamedkarim.belaid@supcom.tn>
| # PairwiseDifferenceOutlierDetection class: Author Nizar Kadri <nizar.kadri@campus.lmu.de> | ||
| # License: Apache-2.0 clause | ||
|
|
| import seaborn as sns | ||
|
|
There was a problem hiding this comment.
don't do import that are not used -.-'
same in the other .py
you can check that with VS code or with Copilot
| self.estimator = estimator if estimator is not None else IsolationForest() | ||
| self.classifier = self.estimator | ||
|
|
||
| def fit(self, X, y): |
There was a problem hiding this comment.
Since this is an unsupervised outlier detection y=None should be the default option
There was a problem hiding this comment.
Since this is an unsupervised outlier detection y=None should be the default option:
if y is not None:
# supervised mode → F1-Optimierung
candidate_percentiles = np.arange(1, 100, 1)
best_f1, best_threshold, best_percentile = -np.inf, None, None
for p in candidate_percentiles:
threshold = np.percentile(mean_scores, p)
y_pred = np.where(mean_scores < threshold, 0, 1)
f1 = f1_score(y, y_pred, average='macro')
if f1 > best_f1:
best_f1, best_threshold, best_percentile = f1, threshold, p
self.threshold_ = best_threshold
self.percentile_ = best_percentile
print(f"[supervised] Best percentile={best_percentile}, F1={best_f1:.4f}")
else:
# unsupervised mode → Threshold aus contamination
perc = 100 * contamination
self.threshold_ = np.percentile(mean_scores, perc)
self.percentile_ = perc
print(f"[unsupervised] Threshold set at {perc:.1f}th percentile")
return self
I have extended the method so that it also works in a fully unsupervised manner —
that is, without using any label information.
However, it turned out that the results (F1, AUC) became worse,
since F1-based threshold tuning is no longer possible.
I am currently working on testing more adaptive thresholding strategies
to improve the performance of the unsupervised approach.
Dataset 1: celeba_baldvsnonbald_normalised.csv
[unsupervised] Threshold set at 10.0th percentile
Dataset 2: bank-additional-full_normalised.csv
[unsupervised] Threshold set at 10.0th percentile
Summary:
Dataset Method Percentile F1_macro
0 Dataset 1 IsolationForest baseline 0.498930
1 Dataset 1 PairwiseDifferenceOutlierDetection 10.0 0.106228
2 Dataset 2 IsolationForest baseline 0.620101
3 Dataset 2 PairwiseDifferenceOutlierDetection 10.0 0.114761
AUC_ROC AUC_PR
0 0.601691 0.074599
1 0.350866 0.045190
2 0.631384 0.133820
3 0.353317 0.054382
| from sklearn.linear_model import LogisticRegression | ||
| from sklearn.neighbors import NearestNeighbors | ||
| from sklearn.ensemble import IsolationForest | ||
| from sklearn.datasets import fetch_openml |
There was a problem hiding this comment.
try to shorten the imports to what you really need
| # Dictionary of dataset file paths | ||
| file_paths = { | ||
| 1: "/datasets/celeba_baldvsnonbald_normalised.csv", | ||
| 2: "/datasets/census-income-full-mixed-binarized.csv", |
There was a problem hiding this comment.
this is still missing in the repo (you did not upload it)
| numpy==1.26.4 | ||
| numpy~=1.26.4 | ||
| pandas~=2.2.0 | ||
| scikit-learn~=1.3.2 | ||
| tqdm | ||
| tqdm~=4.66.2 | ||
| openml~=0.14.2 | ||
| psutil~=5.9.8 | ||
| joblib~=1.2.0 | ||
| scipy~=1.11.4 | ||
| fastparquet No newline at end of file | ||
| fastparquet~=2024.2.0 No newline at end of file |
| self.estimator.fit(X_train_pair) | ||
|
|
||
| # Calculate anomaly scores on the training pairs | ||
| scores = abs(self.estimator.score_samples(X_train_pair)) |
There was a problem hiding this comment.
don’t take abs, keep raw scores, and treat lower as more abnormal. Also, since you subclass OutlierMixin, align with scikit-learn convention: predict should return 1 for inliers and -1 for outliers.
There was a problem hiding this comment.
- scores = abs(self.estimator.score_samples(X_train_pair))
+ scores = self.estimator.score_samples(X_train_pair)
score_df = pd.DataFrame(scores.reshape((-1, len(self.X_train))))
mean_scores = score_df.mean(axis=1).to_numpy()
- y_pred = np.where(mean_scores < threshold, 0, 1)
+ # 1 = inlier, -1 = outlier to match sklearn
+ y_pred = np.where(mean_scores < threshold, -1, 1)
There was a problem hiding this comment.
def predict(self, X):
- scores = self.score_samples(X)
- return np.where(scores < self.threshold_, 0, 1)
+ scores = self.score_samples(X)
+ return np.where(scores < self.threshold_, -1, 1)
| - Pairs all training samples (X_train × X_train) | ||
| - Trains the Isolation Forest on pairwise differences | ||
| - Searches for the best percentile threshold (1% to 99%) that maximizes F1-macro score | ||
| """ |
There was a problem hiding this comment.
set common sklearn attrs:
+ from sklearn.utils.validation import check_array, check_is_fitted
+ X = check_array(X, force_all_finite="allow-nan")
+ self.n_features_in_ = X.shape[1]
+ self.feature_names_in_ = getattr(X, "columns", None)
| self.estimator.fit(X) | ||
| self.X_train = X | ||
|
|
||
| # Create pairwise differences for training | ||
| X_train_pair, _ = self.__pair_input(X, X) | ||
| self.estimator.fit(X_train_pair) |
| import warnings | ||
| warnings.filterwarnings('ignore') |
There was a problem hiding this comment.
| import warnings | |
| warnings.filterwarnings('ignore') |
| - estimator: A scikit-learn compatible estimator (default is IsolationForest). | ||
| """ | ||
| self.estimator = estimator if estimator is not None else IsolationForest() | ||
| self.classifier = self.estimator |
There was a problem hiding this comment.
this is never used delete it
| self.classifier = self.estimator |
| score_df = pd.DataFrame(scores.reshape((-1, len(self.X_train)))) | ||
| return score_df.mean(axis=1).to_numpy() | ||
|
|
||
| def decision_function(self, X): |
There was a problem hiding this comment.
Make it consistent with scikit-learn: positive = more normal.
| # Save all results | ||
| df_results = pd.DataFrame(all_results) | ||
| print("Summary:") | ||
| print(df_results) No newline at end of file |
There was a problem hiding this comment.
add assert on the expected performance
for example
assert F1_macro > 0.7 # the exact value should be arround 0.71
| @@ -0,0 +1,124 @@ | |||
| # Data manipulation and analysis | |||
There was a problem hiding this comment.
rename this file and move it to root
run_benchmark_outlier_detection.py
then add a simple example in folder examples/ and if you want , some unit tests in tests
|
@codex propose a fix to the failing github actions |
For now, I can only help with PRs you've created. |
Summary by Sourcery
Add a pairwise-difference-based outlier detection estimator with threshold optimization, update dependency constraints, and include a test script comparing it to IsolationForest.
New Features:
Build:
Tests: