Latest Articles

Dive deep into the world of Artificial Intelligence with our curated collection of articles, covering the latest breakthroughs and insights from leading researchers and engineers.

Filtering by tag:

GRPO

(7 articles)

March 18, 2026Technology

Flexible Entropy Control in RLVR: Fixing Policy Entropy Collapse with Dynamic Clipping

A practical guide to policy entropy collapse in RLVR and GRPO, covering why PPO clipping drives entropy decay and how dynamic clipping schedules restore exploration.

Qing Ke Ai

LLM Reinforcement Learning RLVR GRPO+3 more

February 27, 2026Technology

LLM Reinforcement Learning (RL): REINFORCE, PPO, GRPO, and Production Engineering

A practical LLM Reinforcement Learning guide covering REINFORCE to PPO/GRPO derivations, plus production engineering patterns like async rollouts, importance sampling, and token-stream stability.

Chi Guo Dong Bu Tu Guo Dong Pi

LLM Reinforcement Learning Reinforcement Learning for LLMs GRPO+3 more

September 16, 2025Technology

GRPO-RoC Explained: Better Training for Tool-Augmented AI (Complete Guide)

Learn how GRPO-RoC fixes outcome-based reward issues. This training method improves AI reasoning by 40% through data curation. With code examples & benchmarks.

Qing Ke Ai

GRPO-RoC tool-augmented models AI training+4 more

September 3, 2025Technology

Replicate DeepSeek R1 with RL: A Guide

Learn to replicate the DeepSeek R1 training process. This guide covers building a reinforcement learning pipeline from scratch using GRPO for advanced LLM reasoning.

Ning Si Ai

DeepSeek R1 Reinforcement Learning Group Relative Policy Optimization+1 more

July 13, 2025Technology

Reinforcement Learning for LLM Reasoning: Trends & Insights

The field of artificial intelligence has seen rapid advancements in reinforcement learning for reasoning, particularly within large language models (LLMs). This article reviews influential research s...

Alex

reinforcement learning for reasoning RL-based reasoning in large language models GRPO+1 more

July 10, 2025Technology

Qwen3 Training Pipeline: 35T Tokens + GRPO RL (10x Faster Than PPO - 2025)

## Qwen3 Training Pipeline: Pre-training, Reinforcement Learning, and Model Distillation ### Qwen3 Pre-training: Building a Robust Foundation Qwen3 training begins with a comprehensive three-stage ...

Alex

Qwen3 training Qwen3 pre-training Qwen3 reinforcement learning+3 more

July 9, 2025Technology

Train 671B DeepSeek V3: RLHF Guide (10x Faster with GRPO - 2025)

Master training 671B parameter LLMs with RL. Solve 5 critical challenges: Megatron vs FSDP, memory offloading, weight conversion, 1000+ GPU scaling. Real DeepSeek V3 workflow with GRPO achieving 10x speedup.

Alex

671B parameter LLM Reinforcement Learning RLHF+4 more