Latest Articles

Dive deep into the world of Artificial Intelligence with our curated collection of articles, covering the latest breakthroughs and insights from leading researchers and engineers.

Filtering by tag:

Reinforcement Learning

(11 articles)

December 9, 2025Technology

12 Practical Lessons for RL Training: Hard-Won Insights from Production

Discover 12 battle-tested lessons from months of production RL training. Learn why stability trumps everything, how agentic RL differs from reasoning RL, and practical strategies to avoid reward hacking in LLM training pipelines.

Chi Guo Dong Bu Tu Guo Dong Pi

Reinforcement Learning RL training agentic RL+5 more

December 3, 2025Large Language Models

DeepSeek-V3.2 vs V3.2-Speciale: Advanced AI Reasoning Models Compared (2025)

DeepSeek-V3.2 rivals Gemini 3.0-Pro with 3 breakthrough innovations: DSA sparse attention, scalable RL framework, and 85K+ agent training tasks. Compare V3.2 vs Speciale for your use case.

Chi Guo Dong Bu Tu Guo Dong Pi

DeepSeek-V3.2 AI Agent reasoning+2 more

November 15, 2025Large Language Models

Kimi K2: A Trillion-Parameter Open-Source LLM

Explore Kimi K2, the 1.04T parameter open-source MoE model. Our deep dive covers its MuonClip optimizer, agentic AI training, and benchmark performance.

Ji Zhi Liu

Kimi K2 MoE LLM architecture+4 more

November 8, 2025AI Research

Jason Wei's 3 Laws of AI: A Future Framework for 2025

Explore Jason Wei's three laws of AI: the Verifier's Law, Commoditization of Intelligence, and the Jagged Edge. A framework for understanding AI's future progress and automation timeline.

Founder Park

AI Agent AI Research Chain-of-Thought+2 more

November 6, 2025Technology

OpenRLHF vs veRL: Ray Framework Deep Dive for Distributed RLHF (2025)

Master distributed RLHF frameworks: Compare OpenRLHF and veRL architectures. Learn Ray Actors, GPU colocation, PPO implementation, and hybrid engine design for scalable reinforcement learning systems.

Qing Ke Ai

OpenRLHF veRL RLHF+5 more

October 17, 2025Technology

Why AI Agents Fail: Latency, Planning & Reflection (2025 Guide)

AI agent challenges explained: solve latency issues, fix brittle planning, and avoid reflection loops. Advanced engineering patterns and RL techniques for production-ready agentic AI systems.

Qing Ke Ai

AI agents agentic AI multi-agent+6 more

September 16, 2025Technology

GRPO-RoC Explained: Better Training for Tool-Augmented AI (Complete Guide)

Learn how GRPO-RoC fixes outcome-based reward issues. This training method improves AI reasoning by 40% through data curation. With code examples & benchmarks.

Qing Ke Ai

GRPO-RoC tool-augmented models AI training+4 more

September 4, 2025Technology

DeepSeek-Coder-V2's Reward Model Explained

Explore the 5 core reward functions powering DeepSeek-Coder-V2. Learn how its modular reward model for accuracy, reasoning, and format shapes AI behavior.

Ning Si Ai

DeepSeek-Coder-V2 reward model reward function+1 more

September 3, 2025Technology

Replicate DeepSeek R1 with RL: A Guide

Learn to replicate the DeepSeek R1 training process. This guide covers building a reinforcement learning pipeline from scratch using GRPO for advanced LLM reasoning.

Ning Si Ai

DeepSeek R1 Reinforcement Learning Group Relative Policy Optimization+1 more

July 24, 2025Technology

Two Major Challenges in Reinforcement Learning Finally Solved by ICLR Papers

Traditional reinforcement learning models struggle with real-time applications due to "AI lag." Two ICLR 2025 papers from Mila introduce groundbreaking solutions to tackle inaction and delay regret, enabling large AI models to operate in high-frequency, dynamic environments without compromising speed or intelligence.

Alex

Technology AI Innovation+1 more

July 9, 2025Technology

Train 671B DeepSeek V3: RLHF Guide (10x Faster with GRPO - 2025)

Master training 671B parameter LLMs with RL. Solve 5 critical challenges: Megatron vs FSDP, memory offloading, weight conversion, 1000+ GPU scaling. Real DeepSeek V3 workflow with GRPO achieving 10x speedup.

Alex

671B parameter LLM Reinforcement Learning RLHF+4 more