Scaling Exponents Across Parameterizations and Optimizers

Everett, Katie; Xiao, Lechao; Wortsman, Mitchell; Alemi, Alexander A.; Novak, Roman; Liu, Peter J.; Gur, Izzeddin; Sohl-Dickstein, Jascha; Kaelbling, Leslie Pack; Lee, Jaehoon; Pennington, Jeffrey

Computer Science > Machine Learning

arXiv:2407.05872 (cs)

[Submitted on 8 Jul 2024 (v1), last revised 16 Jul 2024 (this version, v2)]

Title:Scaling Exponents Across Parameterizations and Optimizers

Authors:Katie Everett, Lechao Xiao, Mitchell Wortsman, Alexander A. Alemi, Roman Novak, Peter J. Liu, Izzeddin Gur, Jascha Sohl-Dickstein, Leslie Pack Kaelbling, Jaehoon Lee, Jeffrey Pennington

View PDF

Abstract:Robust and effective scaling of models from small to large width typically requires the precise adjustment of many algorithmic and architectural details, such as parameterization and optimizer choices. In this work, we propose a new perspective on parameterization by investigating a key assumption in prior work about the alignment between parameters and data and derive new theoretical results under weaker assumptions and a broader set of optimizers. Our extensive empirical investigation includes tens of thousands of models trained with all combinations of three optimizers, four parameterizations, several alignment assumptions, more than a dozen learning rates, and fourteen model sizes up to 26.8B parameters. We find that the best learning rate scaling prescription would often have been excluded by the assumptions in prior work. Our results show that all parameterizations, not just maximal update parameterization (muP), can achieve hyperparameter transfer; moreover, our novel per-layer learning rate prescription for standard parameterization outperforms muP. Finally, we demonstrate that an overlooked aspect of parameterization, the epsilon parameter in Adam, must be scaled correctly to avoid gradient underflow and propose Adam-atan2, a new numerically stable, scale-invariant version of Adam that eliminates the epsilon hyperparameter entirely.

Comments:	63 pages, International Conference on Machine Learning 2024
Subjects:	Machine Learning (cs.LG)
Cite as:	arXiv:2407.05872 [cs.LG]
	(or arXiv:2407.05872v2 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2407.05872

Submission history

From: Katie Everett [view email]
[v1] Mon, 8 Jul 2024 12:32:51 UTC (3,668 KB)
[v2] Tue, 16 Jul 2024 17:40:09 UTC (3,682 KB)

Computer Science > Machine Learning

Title:Scaling Exponents Across Parameterizations and Optimizers

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Scaling Exponents Across Parameterizations and Optimizers

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators