Project Overview
Developing a novel approach to improve LLMs’ mathematical reasoning through self-supervision, building on MATH-SHEPHERD’s success with Process Reward Models (PRM). The method aims to internalize verification capabilities within the model while maintaining training stability through curriculum learning.
This method essentially trades large amounts of training data for compute time, shifting the focus from raw data quantity to model understanding and reasoning ability. As a result, this approach has the potential to generalize well to novel problems, enabling LLMs to solve challenges they may not have seen before—mimicking a true form of intelligence.
Key innovations:
- Internal Process Reward Model using reinforcement learning
- Stable training through hierarchical curriculum framework
- Novel completion-based verification mechanism
- PPO-based optimization for mathematical reasoning
Technical Approach
- Hierarchical learning framework for progressive capability building
- Multiple completion mechanism for robust verification
- Reward function based on completion success rates
- Automated mastery verification system
Prototype Goals
- Implement base verification mechanism
- Test curriculum learning structure
- Focus on algebra-level problems as proof of concept
Resource Requirements
- Single GPU (A100/V100)
- 4-6 weeks development time
- ~100GB storage
- Initial prototype phase
Background
PhD in Applied Mathematics with focus on stochastic processes and optimization. Experience in complex system modeling and market dynamics.
Seeking
- GPU access for initial prototyping
- Technical feedback on implementation approach
- Collaboration opportunities
Detailed technical proposal and implementation plan available upon request.