-
Notifications
You must be signed in to change notification settings - Fork 478
Open
Labels
Description
This is based on a Gitter conversation with @adrinjalali @hildeweerts @MiroDudik where we agreed that it would be nice to have a synthetic dataset available for our examples. @adrinjalali suggested the following code using sklearn's make_classification:
rng = RandomState(seed=42)
X_women, y_women = make_classification(
n_samples=500,
n_features=20,
n_informative=4,
n_classes=2,
class_sep=1,
random_state=rng,
)
X_men, y_men = make_classification(
n_samples=500,
n_features=20,
n_informative=4,
n_classes=2,
class_sep=2,
random_state=rng,
)
X_unspecified, y_unspecified = make_classification(
n_samples=500,
n_features=20,
n_informative=4,
n_classes=2,
class_sep=0.5,
random_state=rng,
)
X = np.r_[X_women, X_men, X_unspecified]
y = np.r_[y_women, y_men, y_unspecified]
gender = np.r_[["Woman"] * 500, ["Man"] * 500, ["Unspecified"] * 500].reshape(
-1,
)
X_train, X_test, y_train, y_test, gender_train, gender_test = train_test_split(
X, y, gender, test_size=0.3, random_state=rng
)
@MiroDudik suggested extending this to have at least 2 sensitive features and 1 control feature to allow us to use it in basically all our examples.
@fairlearn/fairlearn-maintainers any objection with putting this in the fairlearn.datasets module?