LLM Reinforcement Learning (RL): REINFORCE, PPO, GRPO, and Production Engineering
A practical LLM Reinforcement Learning guide covering REINFORCE to PPO/GRPO derivations, plus production engineering patterns like async rollouts, importance sampling, and token-stream stability.
Chi Guo Dong Bu Tu Guo Dong Pi