Add one-hot-encoding for category features by c3p0-upgini · Pull Request #64 · upgini/upgini

c3p0-upgini · 2022-05-04T10:45:51Z

No description provided.

JiwaniZakir

The per-value loop in _one_hot_encode (for other in value_counts.index[MAX_OHE_CAT_FEATURES:]) calling replace(..., inplace=True) on a column slice is both deprecated in recent pandas (chained assignment) and O(n) in the number of excess unique values. A single vectorized replacement is cleaner and significantly faster: df[cat_feature] = df[cat_feature].where(df[cat_feature].isin(value_counts.index[:MAX_OHE_CAT_FEATURES]), "other").

In _prepare_to_fit, the threshold 10 (line if len(cat_features) < 10) is a magic number that should be extracted as a named constant alongside MAX_OHE_CAT_FEATURES = 100; it's not obvious why 10 was chosen or how it relates to the cardinality cap.

The call pd.get_dummies(df) without specifying columns=cat_features will silently encode any other object-dtype columns present in X beyond the explicitly identified cat_features, which could produce unexpected columns and break shape assumptions downstream. Passing columns=cat_features explicitly would make the behavior deterministic.

The test in test_one_hot_encoding constructs cat_feature_1 with range(100) * 2 + range(100, 200), giving 200 unique values, and relies on value_counts ordering to determine which 100 are kept—but value_counts sorts by frequency (descending) then by value for ties, so value_0 through value_99 each appear twice while value_100–value_199 appear once. The test assertion encoded.loc[250, "cat_feature_1_other"] == 1 implicitly depends on this ordering; a comment explaining the assumption would prevent future confusion.

Add one-hot-encoding for category features

a145ee0

c3p0-upgini requested a review from PoUpgini May 4, 2022 13:30

Pass cv to calculate_metrics

bb45d21

JiwaniZakir reviewed Mar 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add one-hot-encoding for category features#64

Add one-hot-encoding for category features#64
c3p0-upgini wants to merge 2 commits into
mainfrom
one-hot-encoding

c3p0-upgini commented May 4, 2022

Uh oh!

JiwaniZakir left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

c3p0-upgini commented May 4, 2022

Uh oh!

JiwaniZakir left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants