Force `fp32` in `attention.MultiHeadDotProductAttention` for softmax operator

Hi,

As I was trying to implement mixed precision training under Flax for my project, I noticed that the `force_fp32_for_softmax` flag defined in `attention.MultiHeadDotProductAttention` does not get to pass into `dot_product_attention` (the default attention function). 
<img width="464" alt="Screenshot 2024-06-18 at 11 18 38 PM" src="https://github.com/google/flax/assets/90777911/81ee2c74-2dc9-4fd7-a76c-2d9c921df8ff">
<img width="461" alt="Screenshot 2024-06-18 at 11 17 55 PM" src="https://github.com/google/flax/assets/90777911/93d2c348-3837-4fa4-b74d-30778c7e16f6">
<img width="441" alt="Screenshot 2024-06-18 at 11 19 21 PM" src="https://github.com/google/flax/assets/90777911/b2daa7c9-45ef-41ca-aa2d-db594484dd35">

I think this might lead to loose control over the softmax operator and result in some stability issues under `bf16` or `fp16` precision, so I wonder if there's an alternate? Thanks!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Force `fp32` in `attention.MultiHeadDotProductAttention` for softmax operator #4008

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Force fp32 in attention.MultiHeadDotProductAttention for softmax operator #4008

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Force `fp32` in `attention.MultiHeadDotProductAttention` for softmax operator #4008