Skip to content

Conversation

@dvruette
Copy link
Contributor

@dvruette dvruette commented Aug 2, 2025

Collection of various improvements:

  • Attention logits in model output: useful for experiment tracking and debugging
  • Attention softcap: used by Gemma, now also supported by vanilla attention
  • DP sharding of batch: now correctly shards the batch along the DP axis. Depends on a breaking change in eformer ([fix] Apply with_sharding_constraint recursively to pytrees eformer#4).
  • Minor change to check_interval dtype for memory tracking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant