Multi-layer GRPO

Fei Ding

Multi-layer GRPO

Authors: Fei Ding

The success of DeepSeek-R1 has demonstrated the effectiveness of the GRPO algorithm. However, due to the absence of process rewards, GRPO often suffers from inefficiencies in exploration, as a single detailed error can result in an entirely incorrect final answer, leading to zero rewards.To address these challenges, we propose MGRPO (Multi-layer GRPO). In the first layer, GRPO operates identically to the original version, generating an initial response. This response is then fed into a second-stage GRPO process, which primarily trains the model to correct errors. Experimental results indicate that MGRPO outperforms standard GRPO, achieving superior performance.

Comments: 7 Pages.

Download: PDF

Submission history

[v1] 2025-03-16 01:20:00

Unique-IP document downloads: 410 times

Vixra.org is a pre-print repository rather than a journal. Articles hosted may not yet have been verified by peer-review and should be treated as preliminary. In particular, anything that appears to include financial or legal advice or proposed medical treatments should be treated with due caution. Vixra.org will not be responsible for any consequences of actions that result from any form of use of any documents on this website.

Add your own feedback and questions here:
You are equally welcome to be positive or negative about any paper but please be polite. If you are being critical you must mention at least one specific error, otherwise your comment will be deleted as unhelpful.

Artificial Intelligence

Multi-layer GRPO

Submission history