Intuit Biztec
Newsroom
Reasoning at IntuitBiztec: Give your models the Ability to Think
AI/ML

Reasoning at IntuitBiztec: Give your models the Ability to Think

PN

Peyara Nando

Dev

Understanding the Reasoning Revolution

At Intuit Biztec, we've dedicated significant resources to analyzing DeepSeek's groundbreaking research on reasoning capabilities in large language models. The technical papers released by DeepSeek have provided crucial insights into how advanced reasoning can be cultivated in AI systems—knowledge that has proven invaluable as we continue to provide reasoning enhancement services for older model architectures.

The Reinforcement Learning Breakthrough

DeepSeek's most significant innovation was their novel approach to reinforcement learning (RL). While most companies had treated reasoning as an ability that must first be seeded through supervised fine-tuning (SFT), DeepSeek demonstrated that pure RL alone could develop sophisticated reasoning patterns in their models.

Their DeepSeek-R1-Zero model—trained via large-scale RL without any SFT as a preliminary step—achieved impressive performance across mathematical and coding benchmarks, with scores rivaling those of models from industry giants like OpenAI.

The Evolution of a Reasoning Mind

Perhaps the most fascinating aspect of DeepSeek's approach is what they termed the "self-evolution process." Our team was particularly struck by how their models organically developed sophisticated reasoning behaviors through training.

As documented in their paper, the thinking time of DeepSeek-R1-Zero showed consistent improvement throughout the training process—not through external adjustments but as an intrinsic development within the model. The AI naturally learned to solve increasingly complex reasoning tasks by leveraging extended computation time, ranging from hundreds to thousands of reasoning tokens.

One captivating element was their documentation of an "aha moment" where the model learns to reevaluate its approach to problems. This moment—where the model essentially stops, reconsiders, and chooses a new solution path—mirrors human cognitive processes in a way that previous approaches had failed to capture.

The Four-Stage Training Pipeline

For their more refined DeepSeek-R1 model, researchers implemented a comprehensive four-stage training approach that we at Intuit Biztec has studied extensively:

1. Cold Start with Chain-of-Thought Examples: Unlike the zero-shot approach, this phase incorporated thousands of high-quality reasoning examples to provide initial guidance.

2. Reasoning-oriented Reinforcement Learning: Building on this foundation, they applied large-scale RL focused on enhancing the model's reasoning capabilities across domains like mathematics, coding, and science.

3. Rejection Sampling and Supervised Fine-Tuning: Upon convergence of the RL process, they generated new training data through rejection sampling, keeping only the most accurate responses.

4. Full-Spectrum Reinforcement Learning: Finally, they implemented a secondary RL stage aimed at improving helpfulness and harmlessness while preserving reasoning capabilities.

Technical Innovations Behind the Scenes

DeepSeek's implementation of Group Relative Policy Optimization (GRPO) was particularly noteworthy. This algorithm forgoes the traditional critic model—typically the same size as the policy model—and instead estimates baselines from group scores. This optimization significantly reduced computational requirements while maintaining performance.

Their reward system also departed from conventional approaches. Rather than implementing complex neural reward models that could be susceptible to reward hacking, DeepSeek opted for rule-based rewards focused on accuracy and formatting. Our testing has confirmed that this simpler approach often yields more consistent improvements in reasoning capability.

Distillation: Democratizing Advanced Reasoning

One of the most promising aspects of DeepSeek's research for our work at Intuit Biztec was their demonstration that reasoning capabilities could be effectively distilled from larger models to smaller ones. Their experiments showed that directly distilling DeepSeek-R1's reasoning abilities into smaller models like Qwen-7B and Llama-8B produced better results than applying RL directly to these smaller models.

Indeed, their distilled 14B model outperformed many larger models, including some with 32B parameters, confirming that well-designed distillation can be more efficient than scaling model size.

Applying These Insights at Intuit Biztec

Drawing on DeepSeek's pioneering work, our team at Intuit Biztec has developed specialized services to enhance reasoning capabilities in existing AI systems. By adapting their multi-stage training approach and distillation techniques, we've successfully retrofitted older models with advanced reasoning abilities previously thought to require complete retraining or significantly larger parameter counts.

Our benchmarking shows that these enhanced models perform particularly well on mathematical reasoning, step-by-step problem solving, and code generation tasks—precisely the areas where DeepSeek's models demonstrated their strongest improvements.

The Future of AI Reasoning

As we continue to build upon DeepSeek's foundational research, we anticipate further advancements in how reasoning capabilities can be cultivated and transferred between models. The concept of emergent reasoning—where sophisticated problem-solving behaviors arise naturally from well-designed reinforcement signals—represents a significant shift in how we understand and develop AI systems.

By combining DeepSeek's innovations with our own expertise in model optimization, Intuit Biztec remains committed to pushing the boundaries of what's possible in artificial intelligence reasoning—making these capabilities accessible to a wider range of models and applications than ever before.