Add one-hot-encoding for category features#64
Conversation
JiwaniZakir
left a comment
There was a problem hiding this comment.
The per-value loop in _one_hot_encode (for other in value_counts.index[MAX_OHE_CAT_FEATURES:]) calling replace(..., inplace=True) on a column slice is both deprecated in recent pandas (chained assignment) and O(n) in the number of excess unique values. A single vectorized replacement is cleaner and significantly faster: df[cat_feature] = df[cat_feature].where(df[cat_feature].isin(value_counts.index[:MAX_OHE_CAT_FEATURES]), "other").
In _prepare_to_fit, the threshold 10 (line if len(cat_features) < 10) is a magic number that should be extracted as a named constant alongside MAX_OHE_CAT_FEATURES = 100; it's not obvious why 10 was chosen or how it relates to the cardinality cap.
The call pd.get_dummies(df) without specifying columns=cat_features will silently encode any other object-dtype columns present in X beyond the explicitly identified cat_features, which could produce unexpected columns and break shape assumptions downstream. Passing columns=cat_features explicitly would make the behavior deterministic.
The test in test_one_hot_encoding constructs cat_feature_1 with range(100) * 2 + range(100, 200), giving 200 unique values, and relies on value_counts ordering to determine which 100 are kept—but value_counts sorts by frequency (descending) then by value for ties, so value_0 through value_99 each appear twice while value_100–value_199 appear once. The test assertion encoded.loc[250, "cat_feature_1_other"] == 1 implicitly depends on this ordering; a comment explaining the assumption would prevent future confusion.
No description provided.