Skip to content

ENH Creating a synthetic example dataset #793

@romanlutz

Description

@romanlutz

This is based on a Gitter conversation with @adrinjalali @hildeweerts @MiroDudik where we agreed that it would be nice to have a synthetic dataset available for our examples. @adrinjalali suggested the following code using sklearn's make_classification:

rng = RandomState(seed=42)

X_women, y_women = make_classification(
    n_samples=500,
    n_features=20,
    n_informative=4,
    n_classes=2,
    class_sep=1,
    random_state=rng,
)

X_men, y_men = make_classification(
    n_samples=500,
    n_features=20,
    n_informative=4,
    n_classes=2,
    class_sep=2,
    random_state=rng,
)

X_unspecified, y_unspecified = make_classification(
    n_samples=500,
    n_features=20,
    n_informative=4,
    n_classes=2,
    class_sep=0.5,
    random_state=rng,
)

X = np.r_[X_women, X_men, X_unspecified]
y = np.r_[y_women, y_men, y_unspecified]
gender = np.r_[["Woman"] * 500, ["Man"] * 500, ["Unspecified"] * 500].reshape(
    -1,
)

X_train, X_test, y_train, y_test, gender_train, gender_test = train_test_split(
    X, y, gender, test_size=0.3, random_state=rng
)

@MiroDudik suggested extending this to have at least 2 sensitive features and 1 control feature to allow us to use it in basically all our examples.

@fairlearn/fairlearn-maintainers any objection with putting this in the fairlearn.datasets module?

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions