ADOPT: Modified Adam Can Converge with Any $β_2$ with the Optimal Rate

Official Implementation of "ADOPT: Modified Adam Can Converge with Any β₂ with the Optimal Rate", which is presented at NeurIPS 2024.

Update on Nov 22, 2024

Based on feedbacks from some practitioners, we have updated the implementation to improve the stability of our ADOPT algorithm. In the original version, ADOPT sometimes gets unstable especially in the early stage of training. This seems to be because the near-zero division by the second memont estimate occurs when some elements of the parameter gradient are near zero at initialization. For example, when some parameters (e.g., the last layer of a neural net) are initialized with zero, which is often-used technique in deep learing, a near-zero gradient is observed at the first parameter update. To avoid such near-zero divisions, we have decided to add a clipping operation in the momentum update.

Even when clipping is applied, the convergence guarantee in theory is maintained by properly scheduling the clipping value (see the updated arXiv paper). In our implementation, the clipping value is controlled by the argument clip_lambda, which is a callable function that determines the schedule of the clipping value depending on the number of gradient steps. By default, the clipping value is set to step**0.25, which aligns with the theory to ensure the convergence. We observe that the clipped ADOPT works much more stably than the original one, so we recommend to use it over the unclipped version. If you want to reproduce the behaivior of the original version, you should set clip_lambda = None.

Requirements

ADOPT requires PyTorch 2.5.0 or later.

Usage

You can use ADOPT just like any other PyTorch optimizers by copying adopt.py to your project.

When you replace the Adam optimizer to our ADOPT, you should just replace the optimizer as follows:

from adopt import ADOPT
# optimizer = Adam(model.parameters(), lr=1e-3)
optimizer = ADOPT(model.parameters(), lr=1e-3)

When you are using AdamW as a default optimizer, you should set decouple=True for our ADOPT:

# optimizer = AdamW(model.parameters(), lr=1e-3)
optimizer = ADOPT(model.parameters(), lr=1e-3, decouple=True)

Citation

If you use ADOPT in your research, please cite the paper.

@inproceedings{taniguchi2024adopt,
 author={Taniguchi, Shohei and Harada, Keno and Minegishi, Gouki and Oshima, Yuta and Jeong, Seong Cheol and Nagahara, Go and Iiyama, Tomoshi and Suzuki, Masahiro and Iwasawa, Yusuke and Matsuo, Yutaka},
 booktitle = {Advances in Neural Information Processing Systems},
 title = {ADOPT: Modified Adam Can Converge with Any β2 with the Optimal Rate},
 year = {2024}
}

License

Apache 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
gpt		gpt
imagenet		imagenet
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
adopt.py		adopt.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ADOPT: Modified Adam Can Converge with Any $β_2$ with the Optimal Rate

Update on Nov 22, 2024

Requirements

Usage

Citation

License

About

Releases

Packages

Contributors 3

Languages

License

iShohei220/adopt

Folders and files

Latest commit

History

Repository files navigation

ADOPT: Modified Adam Can Converge with Any $β_2$ with the Optimal Rate

Update on Nov 22, 2024

Requirements

Usage

Citation

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages