[Research & Compute Request] Self-Supervised Mathematical Reasoning in LLMs

Project Overview

Developing a novel approach to improve LLMs’ mathematical reasoning through self-supervision, building on MATH-SHEPHERD’s success with Process Reward Models (PRM). The method aims to internalize verification capabilities within the model while maintaining training stability through curriculum learning.

This method essentially trades large amounts of training data for compute time, shifting the focus from raw data quantity to model understanding and reasoning ability. As a result, this approach has the potential to generalize well to novel problems, enabling LLMs to solve challenges they may not have seen before—mimicking a true form of intelligence.

Key innovations:

  • Internal Process Reward Model using reinforcement learning
  • Stable training through hierarchical curriculum framework
  • Novel completion-based verification mechanism
  • PPO-based optimization for mathematical reasoning

Technical Approach

  • Hierarchical learning framework for progressive capability building
  • Multiple completion mechanism for robust verification
  • Reward function based on completion success rates
  • Automated mastery verification system

Prototype Goals

  • Implement base verification mechanism
  • Test curriculum learning structure
  • Focus on algebra-level problems as proof of concept

Resource Requirements

  • Single GPU (A100/V100)
  • 4-6 weeks development time
  • ~100GB storage
  • Initial prototype phase

Background

PhD in Applied Mathematics with focus on stochastic processes and optimization. Experience in complex system modeling and market dynamics.

Seeking

  • GPU access for initial prototyping
  • Technical feedback on implementation approach
  • Collaboration opportunities

Detailed technical proposal and implementation plan available upon request.

1 Like