Drop-Muon: Update Less, Converge Faster

Gruntkowska, Kaja; Maziane, Yassine; Qu, Zheng; Richtárik, Peter

计算机科学 > 机器学习

arXiv:2510.02239 (cs)

[提交于 2025年10月2日 ]

标题： Drop-Muon：更新更少，收敛更快

标题： Drop-Muon: Update Less, Converge Faster

Authors:Kaja Gruntkowska, Yassine Maziane, Zheng Qu, Peter Richtárik

摘要：深度学习优化的传统观念规定在每一步更新所有层——这一原则被所有最近的最先进的优化器如Muon所遵循。在本工作中，我们挑战这一假设，表明在理论和实践中，全网络更新根本上是次优的。我们引入了一种非欧几里得随机渐进训练方法——Drop-Muon——一个简单而强大的框架，根据随机计划每一步只更新一部分层，结合了渐进训练的效率与特定层的非欧几里得更新，以实现顶级性能。我们在逐层平滑和逐层$(L^0, L^1)$-平滑条件下提供了严格的收敛保证，涵盖了确定性和随机梯度设置，标志着在随机和非平滑情况下渐进训练的首次此类结果。我们的成本分析进一步表明，只有当层平滑常数之间存在非常特定的关系时，全网络更新才是最优的。通过受控的CNN实验，我们实证证明Drop-Muon始终优于全网络Muon，在墙钟时间内达到相同精度的速度快至$1.4\times$。综上所述，我们的结果表明了大规模模型可以高效训练方式的转变，挑战了现状，并为全网络更新提供了一个高效且理论基础牢固的替代方案。

摘要： Conventional wisdom in deep learning optimization dictates updating all layers at every step-a principle followed by all recent state-of-the-art optimizers such as Muon. In this work, we challenge this assumption, showing that full-network updates can be fundamentally suboptimal, both in theory and in practice. We introduce a non-Euclidean Randomized Progressive Training method-Drop-Muon-a simple yet powerful framework that updates only a subset of layers per step according to a randomized schedule, combining the efficiency of progressive training with layer-specific non-Euclidean updates for top-tier performance. We provide rigorous convergence guarantees under both layer-wise smoothness and layer-wise $(L^0, L^1)$-smoothness, covering deterministic and stochastic gradient settings, marking the first such results for progressive training in the stochastic and non-smooth regime. Our cost analysis further reveals that full-network updates are not optimal unless a very specific relationship between layer smoothness constants holds. Through controlled CNN experiments, we empirically demonstrate that Drop-Muon consistently outperforms full-network Muon, achieving the same accuracy up to $1.4\times$ faster in wall-clock time. Together, our results suggest a shift in how large-scale models can be efficiently trained, challenging the status quo and offering a highly efficient, theoretically grounded alternative to full-network updates.

主题：	机器学习 (cs.LG) ; 优化与控制 (math.OC); 机器学习 (stat.ML)
引用方式：	arXiv:2510.02239 [cs.LG]
	(或者 arXiv:2510.02239v1 [cs.LG] 对于此版本)
	https://doi.org/10.48550/arXiv.2510.02239

提交历史

来自： Kaja Gruntkowska [查看电子邮件]
[v1] 星期四， 2025 年 10 月 2 日 17:28:55 UTC (653 KB)

计算机科学 > 机器学习

标题： Drop-Muon：更新更少，收敛更快

标题： Drop-Muon: Update Less, Converge Faster

提交历史

获取论文：

参考文献与引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

计算机科学 > 机器学习

标题： Drop-Muon：更新更少，收敛更快 显示英文标题

标题： Drop-Muon: Update Less, Converge Faster

提交历史

获取论文：

参考文献与引用

BibTeX 格式的引用

收藏

文献和引用工具

与本文相关的代码，数据和媒体

演示

推荐器和搜索工具

arXivLabs：与社区合作伙伴的实验项目

标题： Drop-Muon：更新更少，收敛更快