Attention flops calculation doesn't reflect causal masking

### Bug report

It looks like the [attention flops calculation](https://github.com/AI-Hypercomputer/maxtext/blob/35143b69f74ce6be078a43f652d046df41ef41f3/MaxText/maxtext_utils.py#L282) assumes you must compute the entire qk and wv products, when with causal masking you only need to compute it across half the key/value.  We noticed this because in local attention calculations we [more precisely](https://github.com/AI-Hypercomputer/maxtext/blob/35143b69f74ce6be078a43f652d046df41ef41f3/MaxText/maxtext_utils.py#L169) count how much of the key/value we need for each query, and this reduces the number of flops a lot.  This makes the top-line tflop/s/device number get significantly worse, but this is mostly a false signal because the global attention calculations are really giving too much credit for flops that don't need to be done.

We weren't sure if there's an accepted best practice for this sort of calculation, so we checked Megatron, and they [do seem](https://github.com/NVIDIA/Megatron-LM/blob/250b79415dcc4b660521273c87f15334c804eeae/megatron/training/training.py#L361-L362) to divide the attention flops by 2 to account for this.  I think we should do the same here.

### Logs/Output

_No response_

### Environment Information

_No response_

### Additional Context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Attention flops calculation doesn't reflect causal masking #1972

Bug report

Logs/Output

Environment Information

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Attention flops calculation doesn't reflect causal masking #1972

Description

Bug report

Logs/Output

Environment Information

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions