**Hanqing Zhu^†,** Zhenyu Zhang^, *****Hanxian Huang†DiJia Su†Zechun Liu†Jiawei Zhao†Igor Fedorov†Hamed Pirsiavash†, Zhizhou Sha^Jinwon Lee†David Z. Pan^, Zhangyang Wang^Yuandong Tian*†Kai Sheng Tai*†

*: Equal advisory contribution, †: Meta AI, ^: The University of Texas at Austin

Link: https://arxiv.org/abs/2511.08567

First online on Nov 10, 2025 | Preliminary version accepted by NeurIPS 2025 ER Workshop

<aside>

TL;DR: We present the first parameter-level map of RLVR’s training dynamics.

A surprising regularity in how parameters evolve: RLVR boosts performance without pushing on principal weights—a model-conditioned optimization bias: for the fixed pretrained model, updates consistently land in the same off-principal, low-curvature region across datasets and RL variants——preserving spectra and causing minimal subspace rotation throughout training.

We introduce a Three-Gate Theory (KL anchor → geometry → precision) that explains and characterizes this unique optimization dynamic, which also explains recent observations directly from parameter space — RL updates appear sparse; RL forgets less; Online quantization calibrates once. Crucially, it shows RLVR operates in a distinct optimization regime from SFT (which targets principal weights), and your old SFT-era PEFT tricks (Sparse fine-tuning, advanced LoRA variants, PiSSA mentioned in the recent Thinking Machines blog) do not work well for RL!

We hope this work charts a path toward a white-box understanding of RLVR and the design of RLVR-native learning algorithms, rather than repurposed SFT-era heuristics. 🙂

</aside>

lrm-teaser-v8.jpg

Figure 1 — SFT vs. RLVR: the geometry at a glance.(a) SFT follows an externally guided route, pushing through high-curvature/principal directions (“over the mountain”). (b) RLVR behaves as if steered by an implicit compass—a model-conditioned bias—taking a low-curvature, off-principal detour (“around the mountain”). (c) Evidence. Left: heatmaps compare an update mask (where parameters change) with a principal mask (top-k principal-direction locations). RLVR updates avoid principal weights; SFT targets them. Right: principal-angle curves show RLVR rotates much less, while SFT rotates more.

🧐A Hidden Optimization Bias in RLVR

We observe a unique optimization behavior in RLVR!

Aha! RLVR routes updates to the same preferred regions across runs/datasets/recipes for a fixed pretrained model → a model-conditioned optimization bias.

ds-5model-consensus.gif

Figure 2 — RL localizes updates to the same regions for the same model. Each panel shows a 0–1 update mask (1 = changed, 0 = unchanged). Despite different data and algorithms, the stripe pattern recurs across runs

RL localizes updates to the same regions for the same model

The optimization bias persists throughout training steps

The bias generalizes across model families

🤗 A Three-Gate Theory That Explains the Bias

What drives RLVR's distinctive training dynamics?

We propose a Three-Gate Theory: every RL step passes through three gates that collectively steer updates away from principal directions and into low-curvature, spectrum-preserving regions.

⬇️ Gate I — KL Anchor Constrains Your Updates

⬇️ Gate II — Model Geometry Determines Where a KL-Bounded Step Goes