Implementations of different Deep Learning architectures and algorithms, using only basic ops provided by pytorch.
| Model | wikitext-103 ppl | Closest public model |
|---|---|---|
| gpt2 12l | 26.7 | 26.37 (gpt2-medium) |
- RNN vs LSTM vs GRU on toy dataset of "abcdef...": tensorboard
- RNN vs LSTM vs GRU on toy dataset of "a...ab..bc..c...": tensorboard
- Implement transformer with self-attention
- Implement sinusoidal position embeddings
- Implement relative position bias a la T5
- Implement RoPE embeddings
- Add support for cross-attention, as used in NMT
- Implement beam search decoding
- Implement RNN
- Implement LSTM
- Implement GRU
- Implement RWKV
- Load wikitext dataset
- Implement training loop
- Implement tool for generating text
- Set up tensorboard metrics, text samples
- Implement model checkpoint saving / resume
- Init correctly and verify initial loss is -log(1/50000)
- Limit train set to 1 batch and verify train loss goes to 0
- Try mixed precision
- Scale to 1.5B param model
- Scale to 1024 sequence length
- Implement evaluation framework
- Collect popular LM benchmarks and published metrics