ScheduleFree+: Scaling Learning-Rate-Free & Schedule-Free Learning to Large Language Models
By Aaron Defazio (FAIR at Meta Super-Intelligence Labs) • May 2026
TL;DR
Schedule-Free Learning replaces hyperparameter-sensitive learning rate schedules with aggressive iterate averaging. This work introduces ScheduleFree+, modifying the core algorithm to scale to large-batch LLM pretraining. By reintroducing inner momentum, utilizing an L1-norm adaptive Polyak step size, and employing $\beta$-annealing, it eliminates learning rate tuning entirely while outperforming tuned schedules by 31% on long training runs.
Why This Matters
In modern deep learning, training runs are high-stakes, multi-million dollar endeavors. Yet, the optimization process remains notoriously fragile. Practitioners must carefully orchestrate learning-rate schedules, such as Cosine or Warmup-Stable-Decay (WSD). If a training run is cut short, or if you wish to extend it, the entire schedule must be redesigned and tuned from scratch.
This has led to a culture of "YOLO training runs"—hoping the tuned parameters hold at scale. Schedule-Free Learning aims to change this by replacing schedules with continuous iterate averaging. However, original formulations fell apart when scaled to the massive batch sizes and model dimensions required for Large Language Models (LLMs). ScheduleFree+ fixes these scaling bottlenecks, offering a robust, "anytime" optimizer that matches or exceeds the best hand-tuned schedules without requiring learning rate tuning.
The Big Idea
Instead of decaying the learning rate to force convergence at the end of training, Schedule-Free optimization maintains an online average of the model parameters. The core insight is to separate the point where gradients are evaluated ($y_t$) from the running average ($x_t$) and the raw base iterate ($z_t$).
By evaluating gradients at a carefully interpolated point $y_t = (1-\beta)z_t + \beta x_t$, the optimizer avoids the terrible convergence properties of standard Polyak averaging in deep networks, while retaining the stability and "anytime" stopping properties of averaged iterates.
How It Works: The Mechanics of ScheduleFree+
To make Schedule-Free learning work for LLMs, the authors introduced four critical modifications to the base algorithm:
- Inner Momentum Reintroduction: Restores AdamW's internal momentum ($\beta_1 = 0.9$) which was omitted in early versions, preventing divergence at massive batch sizes.
- Inverse-Gradient Norm Weighting: Scales step sizes inversely to the L1 norm of the gradient ($\gamma_t \propto 1 / \|g_t\|_1$) to combat gradient norm drift caused by weight norm shrinking.
- Polyak Step Size Adaptation: A parameter-free formulation that dynamically calculates the step size on the fly.
- $\beta$-Annealing: Interpolates $\beta$ from $0.9$ to $0.965$ to balance rapid early-stage progress with stable late-stage convergence.
Interactive 1: Interpolation & $\beta$-Annealing
Visualize how the evaluation point $y_t$ is positioned between the raw iterate $z_t$ and the running average $x_t$.
High $\beta$ forces the evaluation point $y_t$ closer to the stable average $x_t$. Low $\beta$ tracks the raw iterate $z_t$.
Annealing $\beta$ over time captures fast initial drops in loss while securing lower late-stage loss.
Interactive 2: Polyak Step-Size vs. Fixed Schedules
Simulate how the dynamic Polyak step size automatically adapts to gradient norms and training progress compared to standard schedules.
The Polyak Formula:
It automatically scales learning rates proportionally to the square-root of batch size under AdamW, eliminating manual tuning.
Interactive 3: The "No YOLO" Loss Predictor
Recreate the paper's finding that Schedule-Free loss curves follow a highly predictable $1/\sqrt{t}$ curve. Train a model, stop early, and project the final loss!
Stop training early and fit the curve $f(t) = \frac{a}{\sqrt{t+b}} + c$ to project the remaining trajectory.
The fitted parameter c represents the theoretical optimal function value $f_*$ as $t \to \infty$.
Results & Scaling Ladders
The authors evaluated ScheduleFree+ across a range of model sizes (120M to 2B parameters) and token budgets (20 to 1000 tokens per parameter). The most dramatic improvements are seen on long-duration training runs (1000 tokens per parameter), which is increasingly representative of modern frontier LLM pretraining.
Compared to a fully grid-searched Linear Decay baseline schedule.
Linear Decay requires 45% more tokens to match ScheduleFree+ final loss.
The Polyak step size sets the learning rate dynamically and optimally.
Limitations & Open Questions
While ScheduleFree+ marks a significant step forward, several open questions and limitations remain:
- Short-horizon performance: At very short training durations (e.g., 20 tokens per parameter), weight and gradient norms do not reach steady state fast enough, occasionally leading to a small performance gap compared to linear decay schedules.
- API integration: Because the optimal evaluation point $y_t$ differs from the running average $x_t$, standard optimizer APIs (which assume gradients are evaluated at the returned weights) require minor modifications.
- Memory Overhead: Maintaining the running average $x_t$ along with AdamW moments increases the overall parameter storage footprint, though this can be mitigated with shard-offloading in distributed setups.
Glossary
Polyak Step Size
A classic step-size rule for optimization that scales updates based on the current loss value relative to the optimal loss ($f_*$) and the norm of the gradient. ScheduleFree+ uses a modified L1-norm version to stabilize stochastic deep learning.
Gradient Norm Drift
The phenomenon where the gradient norms systematically grow or shrink during training, often driven by changing weight norms in networks utilizing normalization layers (like LayerNorm). This alters the effective learning rate over time.
Anytime Training
An optimization paradigm where the model is usable and near-optimal at any point during the training run. Unlike schedule-based training, which requires completing the decay phase to achieve peak performance, anytime training can be stopped at any step.
Citation
@article{defazio2026schedulefreeplus,
title={ScheduleFree+: Scaling Learning-Rate-Free \& Schedule-Free Learning to Large Language Models},
author={Defazio, Aaron},
journal={arXiv preprint arXiv:2605.19095},
year={2026}
}