Group Relative Policy Optimization (GRPO) has proven highly effective in enhancing the alignment capabilities of Large Language Models (LLMs). However, current adaptations of GRPO for the flow matching-based image generation neglect a foundational conflict between its core principles and the distinct dynamics of the visual synthesis process. This mismatch leads to two key limitations: (i) Uniformly applying a sparse terminal reward across all timesteps impairs temporal credit assignment, ignoring the differing criticality of generation phases from early structure formation to late-stage tuning. (ii) Exclusive reliance on relative, intra-group rewards causes the optimization signal to fade as training converges, leading to the optimization stagnation when reward diversity is entirely depleted. To address these limitations, we propose Value-Anchored Group Policy Optimization (VGPO), a framework that redefines value estimation across both temporal and group dimensions. Specifically, VGPO transforms the sparse terminal reward into dense, process-aware value estimates, enabling precise credit assignment by modeling the expected cumulative reward at each generative stage. Furthermore, VGPO replaces standard group normalization with a novel process enhanced by absolute values to maintain a stable optimization signal even as reward diversity declines. Extensive experiments on three benchmarks demonstrate that VGPO achieves state-of-the-art image quality while simultaneously improving task-specific accuracy, effectively mitigating reward hacking.
Motivation. (Left) Sparse vs. Instant Reward Signals During Generation. The sparse terminal reward remains constant, failing to provide varying values for intermediate steps. (Right) Diminishing Reward Std as Policy Converges. Due to the reliance on reward diversity, the reward std declines as the training process advances, potentially leading to optimization stagnation.
VGPO Pipeline. First, to resolve faulty credit assignment, Temporal Cumulative Reward Mechanism (TCRM) transforms sparse terminal rewards into dense, forward-looking process values, enabling a more granular, temporally-aware credit assignment. Second, to counteract policy collapse, Adaptive Dual Advantage Estimation (ADAE) replaces standard normalization with a novel process enhanced by absolute values for advantage computation, ensuring a persistent optimization that remains stable even when reward diversity diminishes.
Comparison Results on Compositional Image Generation, Visual Text Rendering, and Human Preference Alignment benchmarks, evaluated by task performance, image quality, and preference score. ImgRwd: ImageReward; UniRwd: UnifiedReward.
Qualitative Comparison. VGPO achieves superior performance in task accuracy, image quality and fine-grained details.
Ablation Analysis.The impact of TCRM is evaluated on the (a) OCR and (b) PickScore benchmarks, while (c) assesses ADAE's contribution to image quality at an equivalent OCR accuracy. Quality is the average of the five image quality metrics.
@article{VGPO_Shao,
title={Anchoring Values in Temporal and Group Dimensions for Flow Matching Model Alignment},
author={Yawen Shao and Jie Xiao and Kai Zhu and Yu Liu and Wei Zhai and Yang Cao and Zheng-Jun Zha},
journal={arXiv preprint arXiv:2512.12387},
year={2025}
}