feat: Add attention transfer and feature matching to DistillationTrainer #227

opooladz · 2025-10-04T01:13:44Z

Summary

Enhanced DistillationTrainer with advanced distillation strategies for more effective knowledge transfer from teacher to student models.

New Features

1. Attention Transfer Distillation

Matches attention patterns between teacher and student models
Uses cosine distance to align attention maps across layers
Supports layer-wise selection for targeted matching
Properly handles padding with correct normalization

2. Feature Matching Distillation

Matches intermediate hidden representations between models
Uses MSE loss to align feature spaces
Supports layer-wise selection
Properly handles padding with correct normalization

3. Automatic Dimension Matching

Pooling-based shape alignment for different architectures
Handles mismatches in:
- Number of layers (e.g., 80 → 24)
- Hidden dimensions (e.g., 8192 → 2048)
- Attention heads (e.g., 64 → 32)
Zero trainable parameters (pooling-only approach)

Files Changed

File	Lines	Description
`distillation_config.py`	+82	Added 6 new configuration parameters
`_fn.py`	+194	Added 2 new loss functions with proper masking
`distillation_trainer.py`	+6	Updated JIT compilation for new parameters
`pooling.py`	+167	New utility for automatic feature shape matching
`DISTILLATION_UPDATES.md`	+692	Comprehensive technical documentation

Total: ~1,180 lines added

Technical Details

Masking Implementation

Both new loss functions properly handle variable-length sequences:

Attention Transfer: Zeros padded keys/queries, normalizes by valid_tokens × num_heads
Feature Matching: Normalizes by valid_tokens × hidden_dim for parity with unmasked branch
Invariance: Losses are invariant to padding amount, batch size, and sequence length

Design Choices

Feature	Implementation	Rationale
Dimension Matching	Pooling	No learnable params, zero overhead
Architecture	Direct functions	Simpler than strategy pattern
Model Access	`output_hidden_states`/`output_attentions`	Uses existing outputs, no wrapping
Masking	Per-loss-type logic	Proper normalization for each loss

Usage Example

import easydel as ed

# Basic logit distillation (unchanged)
config = ed.DistillationConfig(
    temperature=2.0,
    alpha=0.9,
)

# With attention transfer
config = ed.DistillationConfig(
    temperature=2.0,
    alpha=0.8,
    use_attention_transfer=True,
    attention_loss_weight=0.1,
    attention_match_layers=(6, 12, 18),  # Match specific layers
)

# With feature matching
config = ed.DistillationConfig(
    temperature=2.0,
    alpha=0.7,
    use_feature_matching=True,
    feature_loss_weight=0.2,
    feature_match_layers=(6, 12, 18),
)

# Combined strategies
config = ed.DistillationConfig(
    temperature=2.0,
    alpha=0.7,
    use_attention_transfer=True,
    attention_loss_weight=0.1,
    use_feature_matching=True,
    feature_loss_weight=0.2,
)

trainer = ed.DistillationTrainer(
    arguments=config,
    student_model=student,
    teacher_model=teacher,
    train_dataset=dataset,
    processing_class=tokenizer,
)
trainer.train()

Backward Compatibility

✅ Fully backward compatible

All new parameters default to False or None, so existing code continues to work without modification.

Testing Recommendations

Backward compatibility: Verify existing distillation code works unchanged
Attention transfer: Test with different model architectures
Feature matching: Test with different hidden dimensions
Dimension mismatch: Test teacher/student with very different architectures
Padding invariance: Test with variable-length sequences
Masking correctness: Verify loss denominators match expected formulas

See DISTILLATION_UPDATES.md for detailed testing procedures.

References

Hinton et al., "Distilling the Knowledge in a Neural Network" (2015)
Zagoruyko & Komodakis, "Paying More Attention to Attention" (2017)
Romero et al., "FitNets: Hints for Thin Deep Nets" (2014)

Documentation

Full technical documentation available in DISTILLATION_UPDATES.md including:

Implementation details
Code comparisons (before/after)
Masking behavior analysis
Performance characteristics
Future enhancement ideas

Enhanced DistillationTrainer with advanced distillation strategies for more effective knowledge transfer from teacher to student models. ## New Features 1. **Attention Transfer Distillation** - Matches attention patterns between teacher and student models - Uses cosine distance to align attention maps across layers - Supports layer-wise selection for targeted matching 2. **Feature Matching Distillation** - Matches intermediate hidden representations between models - Uses MSE loss to align feature spaces - Supports layer-wise selection 3. **Automatic Dimension Matching** - Pooling-based shape alignment for different architectures - Handles mismatches in layers, hidden dims, and attention heads - Zero trainable parameters (pooling-only approach) ## Files Changed - **distillation_config.py**: Added 6 new configuration parameters - **_fn.py**: Added 2 new loss functions (~200 lines) - **distillation_trainer.py**: Updated JIT compilation - **pooling.py**: New utility for automatic feature shape matching (167 lines) - **DISTILLATION_UPDATES.md**: Comprehensive technical documentation ## Masking Implementation Both new loss functions properly handle padding: - Attention transfer: zeros padded keys/queries, normalizes by valid_tokens × num_heads - Feature matching: normalizes by valid_tokens × hidden_dim - Losses are invariant to padding amount, batch size, and sequence length ## Backward Compatibility ✅ Fully backward compatible - all new parameters default to False/None ## Usage Example config = ed.DistillationConfig( temperature=2.0, alpha=0.8, use_attention_transfer=True, attention_loss_weight=0.1, attention_match_layers=(6, 12, 18), use_feature_matching=True, feature_loss_weight=0.2, ) trainer = ed.DistillationTrainer(...) trainer.train()

opooladz · 2025-10-04T02:26:57Z

lets see how codex did

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Add attention transfer and feature matching to DistillationTrainer #227

feat: Add attention transfer and feature matching to DistillationTrainer #227

Uh oh!

opooladz commented Oct 4, 2025

Uh oh!

opooladz commented Oct 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

feat: Add attention transfer and feature matching to DistillationTrainer #227

Are you sure you want to change the base?

feat: Add attention transfer and feature matching to DistillationTrainer #227

Uh oh!

Conversation

opooladz commented Oct 4, 2025

Summary

New Features

1. Attention Transfer Distillation

2. Feature Matching Distillation

3. Automatic Dimension Matching

Files Changed

Technical Details

Masking Implementation

Design Choices

Usage Example

Backward Compatibility

Testing Recommendations

References

Documentation

Uh oh!

opooladz commented Oct 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant