GURU: A Reinforcement Learning Framework that Bridges LLM Reasoning Across Six Domains


Limitations of Reinforcement Learning in Narrow Reasoning Domains

Reinforcement Learning RL has demonstrated strong potential to enhance the reasoning capabilities of LLMs, particularly in leading systems such as OpenAI-O3 and DeepSeek-R1. However, most RL research has focused narrowly on math and code, limiting its general applicability. This narrow scope poses two issues: our understanding of how RL improves reasoning may not generalize beyond these domains, and the resulting models often lack versatility. Expanding RL to broader reasoning tasks is challenging due to a lack of reliable reward signals and curated datasets, which are easier to define in mathematical and code-based terms but more difficult in open-ended reasoning domains. 

Narrow Domain Focus and Generalization Challenges

Reinforcement Learning RL has become a popular method for enhancing the reasoning skills of LLMs, especially after successes with models like OpenAI’s GPT-3 and DeepSeek-R1. Many open-source efforts have followed, focusing primarily on mathematical and coding domains. While these models perform well in their niches, their reasoning doesn’t always generalize to broader tasks. At the same time, research has explored how RL influences reasoning. Some studies suggest RL doesn’t teach new skills but boosts the model’s ability to access existing reasoning patterns. However, newer work indicates that extended RL training may unlock entirely new reasoning strategies.

Introduction of GURU Dataset: A Multi-Domain RL Benchmark

Researchers from UC San Diego, MBZUAI, Carnegie Mellon, and Purdue introduce GURU, a 92 K-example RL dataset covering six reasoning domains: Math, Code, Science, Logic, Simulation, and Tabular. Each domain is carefully built with tailored reward functions and rigorous filtering. Training models on GURU reveals that RL outcomes depend heavily on domain familiarity: common domains benefit from cross-domain RL, while unfamiliar ones require in-domain training to improve significantly. Their models, GURU-7B and GURU-32B, outperform prior open models by up to 7.9% across 17 tasks. These findings highlight RL’s domain-specific effects and the value of broad, multi-domain reasoning benchmarks. 

Cross-Domain vs. In-Domain Reinforcement Learning Effects

To better understand how RL supports reasoning across domains, the researchers trained models on both individual and mixed-domain data from the GURU dataset. They found that domains such as Math, Code, and Science benefited more from cross-domain RL, likely due to their stronger presence in pre-training. Mixed-domain training performed as well or better than single-domain training, showing that combining diverse tasks can enhance general reasoning. However, training only on harder examples improved performance in that domain but reduced accuracy on simpler functions in others. These findings suggest that data diversity and balanced difficulty are key to effective, transferable reasoning skills. 

GURU Model Architecture and Evaluation Strategy

The study trained 7B and 32 B-sized models using the GURU dataset to explore how combining multiple domains during RL improves reasoning abilities. Using the Verl framework and GRPO algorithm, models were evaluated on a wide range of tasks, including math, code, logic, science, simulation, and tables, using consistent metrics. Results showed that GURU models outperformed domain-specific baselines and performed well on unseen tasks. Notably, analysis of Pass@k revealed that performance depends on task type, model size, and decoding settings. Larger models benefited more from RL, and tweaking sampling parameters, such as temperature and top-p, helped improve model diversity and reasoning coverage.

Summary: General-Purpose Reasoning with GURU

In conclusion, GURU is a curated RL dataset containing 92,000 high-quality, verifiable examples across six reasoning domains: Math, Code, Science, Logic, Simulation, and Tabular. Unlike prior RL research, which has focused mainly on math and code, GURU enables broader reasoning studies by providing domain-specific reward signals. The researchers train two models, GURU-7B and GURU-32B, which achieve state-of-the-art results on 17 benchmark tasks, particularly excelling in domains underrepresented during pretraining. Their findings show RL can both refine existing knowledge and foster new reasoning abilities. All data, models, and code are publicly released to support further general-purpose reasoning research. 


Check out the Paper, Project Page and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.


Sana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *