**Hanqing Zhu^†,** Zhenyu Zhang^, *****Hanxian Huang†, DiJia Su†, Zechun Liu†, Jiawei Zhao†, Igor Fedorov†, Hamed Pirsiavash†, Zhizhou Sha^, Jinwon Lee†, David Z. Pan^, Zhangyang Wang^, Yuandong Tian*†, Kai Sheng Tai*†
*: Equal advisory contribution, †: Meta AI, ^: The University of Texas at Austin
Link: https://arxiv.org/abs/2511.08567
First online on Nov 10, 2025 | Preliminary version accepted by NeurIPS 2025 ER Workshop
<aside>
TL;DR: We present the first parameter-level map of RLVR’s training dynamics.
A surprising regularity in how parameters evolve: RLVR boosts performance without pushing on principal weights—a model-conditioned optimization bias: for the fixed pretrained model, updates consistently land in the same off-principal, low-curvature region across datasets and RL variants——preserving spectra and causing minimal subspace rotation throughout training.
We introduce a Three-Gate Theory (KL anchor → geometry → precision) that explains and characterizes this unique optimization dynamic, which also explains recent observations directly from parameter space — RL updates appear sparse; RL forgets less; Online quantization calibrates once. Crucially, it shows RLVR operates in a distinct optimization regime from SFT (which targets principal weights), and your old SFT-era PEFT tricks (Sparse fine-tuning, advanced LoRA variants, PiSSA mentioned in the recent Thinking Machines blog) do not work well for RL!
We hope this work charts a path toward a white-box understanding of RLVR and the design of RLVR-native learning algorithms, rather than repurposed SFT-era heuristics. 🙂
</aside>

Figure 1 — SFT vs. RLVR: the geometry at a glance.(a) SFT follows an externally guided route, pushing through high-curvature/principal directions (“over the mountain”). (b) RLVR behaves as if steered by an implicit compass—a model-conditioned bias—taking a low-curvature, off-principal detour (“around the mountain”). (c) Evidence. Left: heatmaps compare an update mask (where parameters change) with a principal mask (top-k principal-direction locations). RLVR updates avoid principal weights; SFT targets them. Right: principal-angle curves show RLVR rotates much less, while SFT rotates more.
We observe a unique optimization behavior in RLVR!
Aha! RLVR routes updates to the same preferred regions across runs/datasets/recipes for a fixed pretrained model → a model-conditioned optimization bias.

Figure 2 — RL localizes updates to the same regions for the same model. Each panel shows a 0–1 update mask (1 = changed, 0 = unchanged). Despite different data and algorithms, the stripe pattern recurs across runs
What drives RLVR's distinctive training dynamics?
We propose a Three-Gate Theory: every RL step passes through three gates that collectively steer updates away from principal directions and into low-curvature, spectrum-preserving regions.