Optimization Breakthrough

ScheduleFree+: Scaling Learning-Rate-Free & Schedule-Free Learning to Large Language Models

By Aaron Defazio (FAIR at Meta Super-Intelligence Labs) • May 2026

TL;DR

Schedule-Free Learning replaces hyperparameter-sensitive learning rate schedules with aggressive iterate averaging. This work introduces ScheduleFree+, modifying the core algorithm to scale to large-batch LLM pretraining. By reintroducing inner momentum, utilizing an L1-norm adaptive Polyak step size, and employing $\beta$-annealing, it eliminates learning rate tuning entirely while outperforming tuned schedules by 31% on long training runs.

No Learning Rate Tuning Polyak Step Size Anytime Training 31% Faster Convergence

Why This Matters

In modern deep learning, training runs are high-stakes, multi-million dollar endeavors. Yet, the optimization process remains notoriously fragile. Practitioners must carefully orchestrate learning-rate schedules, such as Cosine or Warmup-Stable-Decay (WSD). If a training run is cut short, or if you wish to extend it, the entire schedule must be redesigned and tuned from scratch.

This has led to a culture of "YOLO training runs"—hoping the tuned parameters hold at scale. Schedule-Free Learning aims to change this by replacing schedules with continuous iterate averaging. However, original formulations fell apart when scaled to the massive batch sizes and model dimensions required for Large Language Models (LLMs). ScheduleFree+ fixes these scaling bottlenecks, offering a robust, "anytime" optimizer that matches or exceeds the best hand-tuned schedules without requiring learning rate tuning.

The Big Idea

Instead of decaying the learning rate to force convergence at the end of training, Schedule-Free optimization maintains an online average of the model parameters. The core insight is to separate the point where gradients are evaluated ($y_t$) from the running average ($x_t$) and the raw base iterate ($z_t$).

By evaluating gradients at a carefully interpolated point $y_t = (1-\beta)z_t + \beta x_t$, the optimizer avoids the terrible convergence properties of standard Polyak averaging in deep networks, while retaining the stability and "anytime" stopping properties of averaged iterates.

How It Works: The Mechanics of ScheduleFree+

To make Schedule-Free learning work for LLMs, the authors introduced four critical modifications to the base algorithm:

Inner Momentum Reintroduction: Restores AdamW's internal momentum ($\beta_1 = 0.9$) which was omitted in early versions, preventing divergence at massive batch sizes.
Inverse-Gradient Norm Weighting: Scales step sizes inversely to the L1 norm of the gradient ($\gamma_t \propto 1 / \|g_t\|_1$) to combat gradient norm drift caused by weight norm shrinking.
Polyak Step Size Adaptation: A parameter-free formulation that dynamically calculates the step size on the fly.
$\beta$-Annealing: Interpolates $\beta$ from $0.9$ to $0.965$ to balance rapid early-stage progress with stable late-stage convergence.

Interactive 1: Interpolation & $\beta$-Annealing

Visualize how the evaluation point $y_t$ is positioned between the raw iterate $z_t$ and the running average $x_t$.

Widget 1 of 3

Interpolation Parameter ($\beta$)

0.90

High $\beta$ forces the evaluation point $y_t$ closer to the stable average $x_t$. Low $\beta$ tracks the raw iterate $z_t$.

Beta Annealing Mode

Annealing $\beta$ over time captures fast initial drops in loss while securing lower late-stage loss.

z_t (Raw Iterate) y_t = (1-β)z_t + βx_t x_t (Running Average)

Interactive 2: Polyak Step-Size vs. Fixed Schedules

Simulate how the dynamic Polyak step size automatically adapts to gradient norms and training progress compared to standard schedules.

Widget 2 of 3

Gradient Noise Level

0.3

Batch Size Scaling

The Polyak Formula:

$$\gamma_t = \frac{f(y_t) - f_* + \beta \langle \nabla f(y_t), z_t - x_t \rangle}{\sqrt{\frac{\pi}{2}} \|\nabla f(y_t)\|_1}$$

It automatically scales learning rates proportionally to the square-root of batch size under AdamW, eliminating manual tuning.

Step Size ($\gamma_t$) over Training Steps

Polyak (Schedule-Free+) Linear Decay WSD Schedule

Interactive 3: The "No YOLO" Loss Predictor

Recreate the paper's finding that Schedule-Free loss curves follow a highly predictable $1/\sqrt{t}$ curve. Train a model, stop early, and project the final loss!

Widget 3 of 3

Training Cutoff (Prediction Point)

15%

Stop training early and fit the curve $f(t) = \frac{a}{\sqrt{t+b}} + c$ to project the remaining trajectory.

Fitted Parameters:

a --

b --

c (f*) --

The fitted parameter c represents the theoretical optimal function value $f_*$ as $t \to \infty$.

Model Validation Loss (500M Parameter Run)

Actual Loss Predicted Curve

Results & Scaling Ladders

The authors evaluated ScheduleFree+ across a range of model sizes (120M to 2B parameters) and token budgets (20 to 1000 tokens per parameter). The most dramatic improvements are seen on long-duration training runs (1000 tokens per parameter), which is increasingly representative of modern frontier LLM pretraining.

Training Time Reduction

31% to reach same loss

Compared to a fully grid-searched Linear Decay baseline schedule.

Equivalent Training Run

45% longer run needed

Linear Decay requires 45% more tokens to match ScheduleFree+ final loss.

Tuning Overhead

Zero grid search

The Polyak step size sets the learning rate dynamically and optimally.

Limitations & Open Questions

While ScheduleFree+ marks a significant step forward, several open questions and limitations remain:

Short-horizon performance: At very short training durations (e.g., 20 tokens per parameter), weight and gradient norms do not reach steady state fast enough, occasionally leading to a small performance gap compared to linear decay schedules.
API integration: Because the optimal evaluation point $y_t$ differs from the running average $x_t$, standard optimizer APIs (which assume gradients are evaluated at the returned weights) require minor modifications.
Memory Overhead: Maintaining the running average $x_t$ along with AdamW moments increases the overall parameter storage footprint, though this can be mitigated with shard-offloading in distributed setups.

Glossary

Polyak Step Size

A classic step-size rule for optimization that scales updates based on the current loss value relative to the optimal loss ($f_*$) and the norm of the gradient. ScheduleFree+ uses a modified L1-norm version to stabilize stochastic deep learning.

Gradient Norm Drift

The phenomenon where the gradient norms systematically grow or shrink during training, often driven by changing weight norms in networks utilizing normalization layers (like LayerNorm). This alters the effective learning rate over time.

Anytime Training

An optimization paradigm where the model is usable and near-optimal at any point during the training run. Unlike schedule-based training, which requires completing the decay phase to achieve peak performance, anytime training can be stopped at any step.

Citation

@article{defazio2026schedulefreeplus,
  title={ScheduleFree+: Scaling Learning-Rate-Free \& Schedule-Free Learning to Large Language Models},
  author={Defazio, Aaron},
  journal={arXiv preprint arXiv:2605.19095},
  year={2026}
}