-
Notifications
You must be signed in to change notification settings - Fork 478
ENH Add "adversarial debiasing" #973
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- implemented tensorflow part - fit, partial_fit framework implementation - input validation - added TODOs
- moved some stuff - thought about structure, now only predictor is variable - started predict()
worked on UCI adult example!
hildeweerts
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have embarrassingly little experience with pytorch/tensofrlow so I've mostly added a few nitpicks re. naming and such.
| respectively. If none is specified, default is torch, else tensorflow, | ||
| depending on which is installed. | ||
| predictor_model : torch.nn.Module, tensorflow.keras.Model |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we change predictor_model to estimator? I guess it's not strictly an estimator in the scikit-learn sense because it specifically requires a neural network, but it would be more consistent with reductions module and ThresholdOptimizer.
Co-authored-by: Hilde Weerts <24417440+hildeweerts@users.noreply.github.com>
…en/fairlearn into adversarial_debiasing
| # Copyright (c) Microsoft Corporation and Fairlearn contributors. | ||
| # Licensed under the MIT License. | ||
|
|
||
| from tensorflow.keras import Model |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
torch and tensorflow are somewhat problematic imports. We can't add them to the default dependencies of fairlearn. However, you could check if they're installed before importing and otherwise surface an error message. We're doing that for matplotlib elsewhere:
| raise RuntimeError(_MATPLOTLIB_IMPORT_ERROR_MESSAGE) |
Another question is whether we should have a default way of installing tensorflow and torch, for example fairlearn[torch] or fairlearn[tensorflow]. Such "extras" would need to be defined in setup.py, or rather through another requirements-*.txt file.
Finally, we'd need another set of installation tests. You can check test/install for examples on how we do that for matplotlib. Basically, this is to check that everything works as expected in the case that we don't have these packages installed.
| alpha : float, default = 0.1 | ||
| A small number $\alpha$ as specified in the paper. | ||
| cuda : bool, default = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would cuda require extra dependencies? If so, we'd need to test this in two configurations: with and without cuda.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes you require a GPU, special GPU driver (NVIDIA CUDA Toolkit I think), and an extra pip install torch+cuda or something like that (https://pytorch.org/get-started/locally/). But, torch.cuda.is_available() should only be True if the system supports cuda.
Is this at all testable on the CI server? I wouldn't know where to start.
I believe TensorFlow models automatically run on a single GPU if the TensorFlow install is set up properly (with cuda), and run on CPU otherwise. So I was thinking about removing this argument and defaulting to use the GPU if available?
fairlearn/adversarial/__init__.py
Outdated
| # Copyright (c) Microsoft Corporation and Fairlearn contributors. | ||
| # Licensed under the MIT License. | ||
|
|
||
| """Adversarial techniques to help mitigate fairness disparities.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, broadly I would say "to mitigate unfairness." For this particular one it's more about making the model lose the ability to distinguish between sensitive feature groups, right? That is intended to make it fairer, although there's no real guarantee associated with that. Does anyone else have thoughts about naming (including the class name)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The technique does specifically optimize for particular fairness constraint (demographic parity or equalized odds). I think the assumption is that if the model is penalized for learning the sensitive feature, the model's predictions are encouraged to be independent of the sensitive feature, which would satisfy demographic parity. For equalized odds the idea is similar but then we condition on both the sensitive feature and the ground truth target variable.
So maybe something like: "Adversarial techniques for learning neural networks under fairness constraints." Or something like that? I suppose in theory the approach could be extended beyond neural networks, so we could also say "for machine learning under fairness constraints".
FYI in fairlearn.reductions we currently have: "This module contains algorithms implementing the reductions approach to disparity mitigation." - we might want to reconsider that description.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small note, the paper does show that under some typical assumptions (one of which is a sufficiently large adversarial model and that both models converge, which needn't be true in practice) then at convergence the constraint (demographic parity or equalized odds) is satisfied. For some toy example I was able to consistently reproduce this, but for the UCI adult dataset not (yet).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, this is the __init__ file, not sure it even needs a comment :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like it either, but flake8 is telling me to
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed with others that this is not a biggie, but people copy-paste. So for the sake of consistency with what we say elsewhere, I'd just say "help mitigate unfairness" (fairness disparities is weird).
Co-authored-by: Roman Lutz <romanlutz13@gmail.com>
…en/fairlearn into adversarial_debiasing
|
@hildeweerts @romanlutz The author of the paper confirmed that the sensitive feature and prediction could be more than one-dimensional, so I am working hard to make this work. I want to model the API as follows:
However, I have two points of concern:
|
|
This is perhaps naive for reasons I haven't quite thought through yet, but how about reading the targets and deciding if it's binary, multiclass, or regression based on that? ExponentiatedGradient does this (without multiclass) AFAIK without dedicated input variable. |
|
For binary/multiclass there's a whole bunch of utils in scikit-learn that may be of help. For regression a separate I bet @adrinjalali has thoughts on this as well. |
|
Yes, it makes much more sense to me to have two classes, which share most of the code in a parent class, and they do the specific parts for classification and regression there. |
@romanlutz That seems like a good idea actually! Now thinking about it, even if the user will want to do regression while all labels are either 0 or 1, a multinomial distribution will fit better than a normal distribution anyway, so we might want to make this decision for the user. @hildeweerts @adrinjalali Thanks, that structure makes sense! Virtually all code would be shared, but that is also done in However, the problem remains for the I'm kind of tempted to only support binary and continuous sensitive_features, as these can be mixed freely and don't span multiple columns (like multiclass as one-hot encoding does), so this would be a clear and concise solution. Or is there a lot of use for also supporting multiclass features, and letting the users map various groups of columns of sensitive_features? |
Tutorials like to pretend that everything is binary, but in practice there's hardly any sensitive feature that can truly be considered binary. So my first reaction would be to do things the other way around: assume none of the features are one-hot encoded and do one-hot encoding internally for multicategorical features (if necessary?) To distinguish categorical / continuous features I could imagine an argument The independence assumption should be described clearly in the documentation btw, because sensitive features may be statistically related even if they are not one-hot-encoded. |
|
Okay, I think I was able to incorporate all of the comments now, so I will write them down here. Let’s call X, Y, Z the input, prediction, and sensitive features from now on. All data are pd.DataFrame, TrainingBefore training for the first time, we need to preprocess X, Y, Z.
Predictingpreprocess input using the previous mappings, pass through model, and use the SklearnFrom sklearn I can use OneHotEncoding. |
MiroDudik
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not fully done yet with my pass. I'm focusing on API and documentation.
| fairlearn.postprocessing | ||
| fairlearn.preprocessing | ||
| fairlearn.reductions | ||
| fairlearn.adversarial |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because of the existing conflicts the API docs webpage does not show yet. I'll review it once it renders.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Conflicts have been resolved, but it still seems that the webpage on CI doesn't process things correctly. Not sure what's going on.
| the adversary will attain a loss equal to the entropy, so the adversary | ||
| can not | ||
| predict the sensitive features from the predictions. | ||
| Moreover, this model can be trained for either *demographic parity* or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the original paper, they simply suggest to restrict training of the adversary to y=0 and y=1. I suggest we leave this for future though, because the implied notion of fairness would be somewhat different than what we call TruePositiveRateParity and FalsePositiveRateParity.
fairlearn/adversarial/__init__.py
Outdated
| # Copyright (c) Microsoft Corporation and Fairlearn contributors. | ||
| # Licensed under the MIT License. | ||
|
|
||
| """Adversarial techniques to help mitigate fairness disparities.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed with others that this is not a biggie, but people copy-paste. So for the sake of consistency with what we say elsewhere, I'd just say "help mitigate unfairness" (fairness disparities is weird).
| from numpy import zeros, argmax, arange | ||
|
|
||
|
|
||
| class AdversarialFairness(BaseEstimator): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I think I'd be in favor of keeping this one as is--AdversarialFairness. Adding Estimator feels very redundant. I haven't found a single instance of the naming pattern ...Estimator for concrete estimators in sklearn. Also, we don't say things like ExponentiatedGradientEstimator just ExponentiatedGradient.
| one-hot encodings, and it maps strictly continuous-valued (possible 2d) | ||
| to itself. | ||
| a_transform : sklearn.base.TransformerMixin, default = fairlearn.adversarial.FloatTransformer("auto") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the argument name is sensitive_features, I think this should be called sf_transform.
| Must be the same type as the | ||
| :code:`predictor_model`. | ||
| predictor_loss : str, callable, default = 'auto' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't love the current keyword choices for predictor_loss, adversary_loss, because they seem to refer to the type of the target, rather than the loss. If we go that route, we should consider using something similar to sklearn's target types.
Alternatively, we could use keywords that describe the loss, like "square_loss", "logistic_loss"... but let me think a bit more about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You raise an excellent point. Ideally, we'd also provide something like Y_distribution_type and A_distribution_type. I've thought about this before, but I can't remember why I let go of the idea. We should not get rid of the loss parameters though, and we would have to be sure that the inferred distribution type of the preprocessor agrees.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had some further thoughts on this.
I think that the very basic question is whether we want to represent binary classification problems via networks with a single output or two outputs... so that should be decided upfront.
With that said, what do you think about the following tweaks of the current API:
-
base class (currently
AdversarialFairness)- allow specifying
y_transform,sf_transform,predictor_loss,adversary_loss,predictor_function - they can take values
None(no transformation; this might still mean casting ints as a floats?),'auto'(default), a callable, and additionally:- for
predictor/adversary_loss, we support'logistic_loss','square_loss'- if needed, we could also distinguish
'multinomial_logistic_loss'(which acts on one-hot encoding), but this could be inferred from the number of outputs of the network
- if needed, we could also distinguish
- for
y/sf_transform, we support'one_hot_encoder' - for
predictor_function, we support'argmax'(for one-hot representation) and'threshold'(for one-output representation of binary classifiers), with an additional argumentthreshold_valuewith the default value 0
- for
- allow specifying
-
AdversarialFairnessClassifier- allow specifying
sf_transform,adversary_loss - fill in
y_transform,predictor_loss,adversary_loss,predictor_functionso as to support (multinomial) logistic regression; I wouldn't allow overriding these (if that's what you want to do, you can just call the base class) - besides
predictanddecision_function, we should consider supportingpredict_probaandpredict_log_proba
- allow specifying
-
AdversarialFairnessRegressor- allow specifying
sf_transform,predictor_loss,adversary_loss - fill in
y_transform,predictor_functionasNone
- allow specifying
And one more idea--not sure how much I like it, but something that occurred to me. We could change:
predictor_loss->y_lossadversary_loss->sf_loss
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like your thoughts! I really appreciate you taking the time!
I just want to reiterate that we (it was a suggestion from adrin or hilde I think) made the design choice that as much is automatically inferred from data as possible, so that means what kind of transform/loss/predictor function to use. Looking at this now, I am not so sure why I included y/sf_transform as parameters, because without this FloatTransformer class thingy (FloatTransform is the default y/sf_transform) we lose this automatic inference. The nice thing is that if the user has already applied a one-hot-encoding, the FloatTransform class will still infer and tell AdversarialFairness that the data is categorical, which is required to infer what loss function to use. I agree that the current API needs changing, but I do not see yet how we can resolve this nicely.
- for
y/sf_transform, we support'one_hot_encoder'
I think I like this keyword-style better than how it is currently done, but I'm still on the fence. Currently, you can achieve this sf_transform='one_hot_encoder' with sf_transform=FloatEncoder('categorical'), so what you suggest is definitely cleaner. What do you think should happen if the user provides a custom transform? Should we then require the user to (1) also pass loss functions, because we can no longer infer them from FloatEncoder? Or (2) should we do the inferring outside of FloatEncoder? or (3) should we not even let the user pass a custom transform? I find this a difficult choice because all options feel bad in some sense. I've actually switched implementations from (2) to (1) in the past, but I am now tending back towards (2) and using the keywords you suggested. I'd love to hear your thoughts on this. @adrinjalali you were quite involved in this design choice regarding preprocessing in the past, so perhaps you can help us here.
- for
predictor/adversary_loss, we support'logistic_loss','square_loss'
Currently we accept 'binary', 'category', 'continuous', I chose those because they are descriptive of the distribution that you assume, but I understand that you favor more precise names such as 'logistic_loss', 'square_loss', or 'argmax'? I might agree actually.
- fill in
y_transform,predictor_loss,adversary_loss,predictor_functionso as to support (multinomial) logistic regression;
Then we'd still need to infer whether it is categorical or binary I'd say, hence I'd still love to know the distribution type somehow (rather through inferring it from the data than through a keyword parameter)
And one more idea--not sure how much I like it, but something that occurred to me.
I like this!
- besides
predictanddecision_function, we should consider supportingpredict_probaandpredict_log_proba
Yeah sure!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me try to answer your questions--but also let me know if I left something unanswered!
What to do about 'auto' loss functions and predictor_function with custom y/sf_transform.
- The behavior should be the same as what we do when the transform is
None. I'm not sure what you do currently, but a sensible option would be an automatic inference among:- univariate {0.0,1.0} (-> univariate logistic model with sign-based prediction function),
- one-hot-encoded categorical (-> multinomial logistic with argmax prediction function),
- continuous univariate or vectors (-> square loss/L2 norm with identity prediction function),
- otherwise: throw an exception
Loss versus link function ("distribution assumption").
- I think by "distribution assumption" you mean the link function that is used in generalized linear models. I'd be in favor of keeping the idea of link (which is basically our
prediction_function) separate from loss function. For example, it would be awkward to specify Huber loss (for robust regression) using the language of link functions.
Categorical vs. binary encoding in classification.
- For this one, I'm not sure that we have any other option other than inferring from the provided
y. My idea would be to support the binary vs. multiclass classification similarly to sklearn's LogisticRegression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this was really clarifying!
What to do about
'auto'loss functions andpredictor_functionwith customy/sf_transform.
Oh so doing this inference if the transform is None, and otherwise just do the transform and infer then. That sounds good. Basically decoupling this inference from the transform. Okay! Apart from that, I make the same inferences (the i. through iv.). (I initially chose to be a bit more general and accept mixed column types (so for instance binary and continuous columns), but we decided there was no use for this.)
Loss versus link function ("distribution assumption").
I must say I am learning a lot here, I was not familiar with these terms. I understand a reason for keeping these things seperate now, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MiroDudik Hey, I finally got around to this, was a bit more work than I anticipated. What do you think? Internally, I still use "binary", "category", "continuous", because I feel this may be relevant to know later, but the API changed and accepts the better more concrete terms now.
As for the predict_(log_)proba, I am not sure how to interpret this. In case of a single output neuron decision_function, which might give enough flexibility?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment above -> if we specialize to logistic models, then predict_proba and predict_log_proba are obvious, right? These would be only provided for AdversarialMitigationClassifier.
Co-authored-by: Roman Lutz <romanlutz13@gmail.com> Co-authored-by: MiroDudik <mdudik@gmail.com>
| self.warm_start = warm_start | ||
| self.random_state = random_state | ||
|
|
||
| def __setup(self, X, Y, A): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| def __setup(self, X, Y, A): | |
| def _setup(self, X, Y, A): |
the canonical way to indicate a method is private is a single _.
| # Numbers | ||
| check_scalar(self.threshold_value, "threshold_value", (int, float)) | ||
|
|
||
| # Non-negative numbers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm agnostic on whether these should be a separate method or not. Either way is fine with me.
|
|
||
| class FloatTransformer(BaseEstimator, TransformerMixin): | ||
| """ | ||
| Transformer that maps dataframes to numpy arrays of floats. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I meant was to have a ColumnTransformer that does exactly what this transformer does. The output would be an array of floats, and then nothing would be different. You'd apply your ColumnTransformer internally on your y.
I like those too, its much simpler. I was aware of them, but there are some reasons why I am not using this:
Maybe you know of a better way to achieve this? |
| (choose :math:`\alpha` closer to zero) or increasing fairness | ||
| (choose larger :math:`\alpha`). | ||
| epochs : int, default = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would be more common? epochs or num_epochs as here: https://aif360.readthedocs.io/en/latest/modules/generated/aif360.sklearn.inprocessing.AdversarialDebiasing.html
Also, I don't see documentation for max_iter.
| epochs : int, default = 1 | ||
| Number of epochs to train for. | ||
| batch_size : int, default = -1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be in favor of changing the default for batch_size. I think that any default in the range 1-32 will work (the paper above suggests 2-32). The AIF360 implementation uses num_epochs=50, batch_size=128.
| from numpy import zeros, argmax, arange | ||
|
|
||
|
|
||
| class AdversarialFairness(BaseEstimator): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about all of this some more, and I'm still finding parts of the constructor API unnecessarily complicated--and there are various dependencies spread across multiple parameters. In particular, I'd like to suggest an alternative to what's currently handled by: predictor_loss, adversary_loss, predictor_function, y_transform, sf_transform
AdversarialMitigationClassifier
- can handle binary classification and multi-class classification
- for encoding of
y, we follow sklearn`s target types, which are described here - Binary classification
yhas shape(n_samples,)and it contains two values typically {0,1}; if other values are provided they are transformed into {0,1} during fit and predict- the neural net is outputting a single scalar, corresponding to the logit of P(Y=1|X), and is trained by minimizing logistic loss
- Multiclass classification
yhas either shape(n_samples,n_classes)with 0/1 values implementing one-hot-encoding, or it has shape(n_samples,)and contains 3 or more distinct values; in the latter case, it is transformed into one-hot-encoding during fit and predict- the neural net is outputting a vector of
n_classesscalars, corresponding to the linear scores of multinomial logistic model of P(Y|X), and is trained by minimizing log loss (this is called CrossEntropyLoss in PyTorch)
- Adversarial model
- sensible defaults around fitting sensitive features are a bit more tricky, but I think that we should begin by covering the common case of binary or categorical sensitive features:
sensitive_featuresis shape(n_samples,)or(n_samples,n_sensitive_features)where every column has two distinct values -> columns are transformed into {0,1}, and the training loss is the sum of logistic losses (i.e., treated as the sum of binary problems)sensitive_featuresis shape(n_samples,)or(n_samples,n_sensitive_features)where at least one column has three or more values -> columns are transformed using one-hot-encoding, and the training loss is the sum of log losses (i.e., treated as the sum of multiclass problems)
- sensible defaults around fitting sensitive features are a bit more tricky, but I think that we should begin by covering the common case of binary or categorical sensitive features:
- This means that we could get rid of the parameters
predictor_loss, adversary_loss, predictor_function, y_transform, sf_transform.- If we want to expose more generality (say continuous sensitive features), we could do that later.
AdversarialMitigationRegressor
yhas shape(n_samples,)and contains floats, the loss is square loss- sensitive features and adversarial model are treated the same way as in AdversarialMitigationClassifier.
- again, this means we could get rid of
predictor_loss, adversary_loss, predictor_function, y_transform, sf_transform, and expose more generality later.
AdversarialMitigation
- I suggest to make this private; in that case I don't care too much about the API.
| Must be the same type as the | ||
| :code:`predictor_model`. | ||
| predictor_loss : str, callable, default = 'auto' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment above -> if we specialize to logistic models, then predict_proba and predict_log_proba are obvious, right? These would be only provided for AdversarialMitigationClassifier.
|
Meeting summary: have separate PR from this one with the minimal implementation, as small as we're happy to have it released, and have the main base class private at least for now. |
|
@adrinjalali @MiroDudik @romanlutz Moved to #1079 :) |
|
Is this still active, or should it be closed in favour of #1079 ? |
|
Close in favor of #1079 |
Add Adversarial Debiasing algorithm. Replaces PR #973
Solves issue #785 and implements Adversarial Debiasing paper
Updated 24 March.
UPDATE
Should have responded to all feedback now
Old
To summarize important design choices:
fit(X, y, sensitive_feature),predict(X),decision_function(X).yandsensitive_features, but the user needs to preprocessX. (the models are NN's, so we require numeric inputs everywhere)BackendEngineprovides PyTorch/TensorFlow-specific code.yandsensitive_featurescan be from arbitrary distributions. We try to infer the distribution of this data and use this to choose appropriate preprocessors (y_transformanda_transform), loss functions (predictor_lossandadversary_loss), and the decision function (predictor_function). Currently, we only infer whetheryorsensitive_featuresis univariate binomial, univariate multinomial, or multivariate normal. If we can infer such a distribution, we know precisely what to choose as aforementioned kwargs. Otherwise, the user must supply these kwargs explicitly.predictor_optimizerandadversary_optimizerare kwargs, because in practice we see many different optimizers used.callbacks. I found supporting callback functions is particularly useful (as is done in skorch for instance)