Skip to content

Add one-hot-encoding for category features#64

Open
c3p0-upgini wants to merge 2 commits into
mainfrom
one-hot-encoding
Open

Add one-hot-encoding for category features#64
c3p0-upgini wants to merge 2 commits into
mainfrom
one-hot-encoding

Conversation

@c3p0-upgini

Copy link
Copy Markdown
Collaborator

No description provided.

@c3p0-upgini c3p0-upgini requested a review from PoUpgini May 4, 2022 13:30

@JiwaniZakir JiwaniZakir left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The per-value loop in _one_hot_encode (for other in value_counts.index[MAX_OHE_CAT_FEATURES:]) calling replace(..., inplace=True) on a column slice is both deprecated in recent pandas (chained assignment) and O(n) in the number of excess unique values. A single vectorized replacement is cleaner and significantly faster: df[cat_feature] = df[cat_feature].where(df[cat_feature].isin(value_counts.index[:MAX_OHE_CAT_FEATURES]), "other").

In _prepare_to_fit, the threshold 10 (line if len(cat_features) < 10) is a magic number that should be extracted as a named constant alongside MAX_OHE_CAT_FEATURES = 100; it's not obvious why 10 was chosen or how it relates to the cardinality cap.

The call pd.get_dummies(df) without specifying columns=cat_features will silently encode any other object-dtype columns present in X beyond the explicitly identified cat_features, which could produce unexpected columns and break shape assumptions downstream. Passing columns=cat_features explicitly would make the behavior deterministic.

The test in test_one_hot_encoding constructs cat_feature_1 with range(100) * 2 + range(100, 200), giving 200 unique values, and relies on value_counts ordering to determine which 100 are kept—but value_counts sorts by frequency (descending) then by value for ties, so value_0 through value_99 each appear twice while value_100value_199 appear once. The test assertion encoded.loc[250, "cat_feature_1_other"] == 1 implicitly depends on this ordering; a comment explaining the assumption would prevent future confusion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants