Huggingface trl GRPO loss is always zero

John6666 · May 17, 2025, 6:19am

There seems to be a significant peculiarity in GRPO loss.

Why does the loss start at 0 when I train GRPO, and then possibly increase?

opened 04:59AM - 08 Feb 25 UTC

closed 01:33AM - 10 Feb 25 UTC

I am using the distill-1.5b model, and since I only have 4 L20 GPUs, I modified …some parameters and am still training the GRPO model on the NuminaMath-TIR dataset. However, I noticed that the loss remains 0, and I'm not sure where the configuration went wrong. I have ensured that the software versions match those in the setup.py file, and I also updated TRL and transformers to the latest version of the main branch. The specific logs and training configuration are as follows. I would like to know if this is normal and how to fix it. train config： ``` # Model arguments model_name_or_path: /home/base-model/deepseek-r1-distill-qwen-1.5b model_revision: main torch_dtype: bfloat16 # Num processes is less by 1 as vLLM is using 1 GPU num_processes: 3 # GRPO trainer config gradient_accumulation_steps: 2 num_generations: 3 ``` train log ``` [INFO|trainer.py:2348] 2025-02-08 12:02:29,782 >> ***** Running training ***** [INFO|trainer.py:2349] 2025-02-08 12:02:29,782 >> Num examples = 72,441 [INFO|trainer.py:2350] 2025-02-08 12:02:29,782 >> Num Epochs = 1 [INFO|trainer.py:2351] 2025-02-08 12:02:29,782 >> Instantaneous batch size per device = 1 [INFO|trainer.py:2354] 2025-02-08 12:02:29,782 >> Total train batch size (w. parallel, distributed & accumulation) = 6 [INFO|trainer.py:2355] 2025-02-08 12:02:29,782 >> Gradient Accumulation steps = 2 [INFO|trainer.py:2356] 2025-02-08 12:02:29,782 >> Total optimization steps = 36,220 [INFO|trainer.py:2357] 2025-02-08 12:02:29,783 >> Number of trainable parameters = 1,777,088,000 {'loss': 0.0, 'grad_norm': 0.72000175680703, 'learning_rate': 2.760905577029266e-08, 'rewards/accuracy_reward': 0.26666667461395266, 'rewards/format_reward': 0.0, 'rewards/reasoning_steps_reward': 0.6777778208255768, 'rewards/cosine_scaled_reward': -0.022902203630656003, 'reward': 0.921542277932167, 'reward_std': 0.871876309812069, 'completion_length': 876.4000122070313, 'kl': 0.00035610198974609373, 'epoch': 0.0} {'loss': 0.0, 'grad_norm': 0.8210723493263515, 'learning_rate': 5.521811154058532e-08, 'rewards/accuracy_reward': 0.10000000298023223, 'rewards/format_reward': 0.0, 'rewards/reasoning_steps_reward': 0.6333333641290665, 'rewards/cosine_scaled_reward': -0.23128306418657302, 'reward': 0.5020502872765065, 'reward_std': 0.43509662076830863, 'completion_length': 884.033349609375, 'kl': 0.0006114959716796875, 'epoch': 0.0} {'loss': 0.0, 'grad_norm': 0.6075981772711617, 'learning_rate': 8.282716731087798e-08, 'rewards/accuracy_reward': 0.1666666716337204, 'rewards/format_reward': 0.0, 'rewards/reasoning_steps_reward': 0.5555555850267411, 'rewards/cosine_scaled_reward': -0.16871370139997452, 'reward': 0.5535085469484329, 'reward_std': 0.6925141368061304, 'completion_length': 886.1666809082031, 'kl': 0.0005586624145507812, 'epoch': 0.0} {'loss': 0.0, 'grad_norm': 0.7033610775329348, 'learning_rate': 1.1043622308117064e-07, 'rewards/accuracy_reward': 0.1666666716337204, 'rewards/format_reward': 0.0, 'rewards/reasoning_steps_reward': 0.6888889163732529, 'rewards/cosine_scaled_reward': -0.17193117612041534, 'reward': 0.6836243975907564, 'reward_std': 0.7369554199278354, 'completion_length': 892.0000122070312, 'kl': 0.00048828125, 'epoch': 0.0} ... {'loss': 0.0001, 'grad_norm': 0.6114522070289464, 'learning_rate': 1.049144119271121e-06, 'rewards/accuracy_reward': 0.3000000089406967, 'rewards/format_reward': 0.0, 'rewards/reasoning_steps_reward': 0.7333333641290665, 'rewards/cosine_scaled_reward': -0.05265774726867676, 'reward': 0.9806756511330604, 'reward_std': 0.8146779596805572, 'completion_length': 926.8666748046875, 'kl': 0.001399993896484375, 'epoch': 0.01} {'loss': 0.0001, 'grad_norm': 0.6375849273871735, 'learning_rate': 1.0767531750414136e-06, 'rewards/accuracy_reward': 0.1666666716337204, 'rewards/format_reward': 0.0, 'rewards/reasoning_steps_reward': 0.7111111462116242, 'rewards/cosine_scaled_reward': -0.14114616215229034, 'reward': 0.736631666123867, 'reward_std': 0.7692775622010231, 'completion_length': 937.2000122070312, 'kl': 0.001470184326171875, 'epoch': 0.01} {'loss': 0.0001, 'grad_norm': 0.7375909133054507, 'learning_rate': 1.1043622308117063e-06, 'rewards/accuracy_reward': 0.36666667759418486, 'rewards/format_reward': 0.0, 'rewards/reasoning_steps_reward': 0.844444477558136, 'rewards/cosine_scaled_reward': 0.036993000144138935, 'reward': 1.2481041848659515, 'reward_std': 1.0289975732564927, 'completion_length': 829.4000122070313, 'kl': 0.0028339385986328124, 'epoch': 0.01}

github.com/huggingface/trl

loss is always 0 when use LoraConfig in GRPOtrainer

opened 12:23PM - 22 Feb 25 UTC

closed 12:40PM - 22 Feb 25 UTC

Tuziking

⚡ PEFT 🏋 GRPO

### Reproduction ```python import re import torch from datasets import load_dat…aset, Dataset from transformers import AutoTokenizer, AutoModelForCausalLM from trl import GRPOConfig, GRPOTrainer from peft import LoraConfig, get_peft_model, TaskType import wandb import logging from scripts.utils.replace_grpo_trainer import trigger # 配置日志 logging.basicConfig( filename="GRPO-Qwen2.5-7B.log", # 保存的日志文件名 level=logging.INFO, # 日志等级 format="%(asctime)s - %(message)s", # 日志格式 datefmt="%Y-%m-%d %H:%M:%S" ) # 创建日志对象 logger = logging.getLogger("logger") # Load and prep dataset SYSTEM_PROMPT = """ A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the think process in the mind and then provides the user with the answer. The think process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., <think> think process here </think> <answer> answer here </answer> """ XML_COT_FORMAT = """ <think> {think} </think> <answer> {answer} </answer> """ def extract_xml_answer(text: str) -> str: answer = text.split("<answer>")[-1] answer = answer.split("</answer>")[0] return answer.strip() def extract_hash_answer(text: str) -> str | None: if "####" not in text: return None return text.split("####")[1].strip() # uncomment middle messages for 1-shot prompting def get_gsm8k_questions(split = "train") -> Dataset: data = load_dataset('dataset/gsm8k', 'main')[split] # type: ignore data = data.map(lambda x: { # type: ignore 'prompt': [ {'role': 'system', 'content': SYSTEM_PROMPT}, {'role': 'user', 'content': x['question']} ], 'answer': extract_hash_answer(x['answer']) }) # type: ignore return data # type: ignore dataset = get_gsm8k_questions() # Reward functions def correctness_reward_func(prompts, completions, answer, **kwargs) -> list[float]: responses = [completion[0]['content'] for completion in completions] q = prompts[0][-1]['content'] extracted_responses = [extract_xml_answer(r) for r in responses] print(len(responses), len(extracted_responses), len(answer)) # for response, extracted_response, _answer in zip(responses, extracted_responses, answer): logger.info('-'*20) logger.info(f"Question:\n{q}") logger.info(f"Answer:\n{answer[0]}") logger.info(f"Response:\n{responses[0]}") logger.info(f"Extracted:\n{extracted_responses[0]}") logger.info(f"Correctness: {1.0 if extracted_responses[0] == answer[0] else 0.0}")f"\nExtracted:\n{extracted_responses[0]}") return [2.0 if r == a else 0.0 for r, a in zip(extracted_responses, answer)] def int_reward_func(completions, **kwargs) -> list[float]: responses = [completion[0]['content'] for completion in completions] extracted_responses = [extract_xml_answer(r) for r in responses] return [0.5 if r.isdigit() else 0.0 for r in extracted_responses] def soft_format_reward_func(completions, **kwargs) -> list[float]: """Reward function that checks if the completion has a specific format.""" pattern = r"<think>.*?</think>\s*<answer>.*?</answer>" responses = [completion[0]["content"] for completion in completions] matches = [re.match(pattern, r) for r in responses] return [0.5 if match else 0.0 for match in matches] def count_xml(text) -> float: count = 0.0 if text.count("<think>\n") == 1: count += 0.125 if text.count("\n</think>\n") == 1: count += 0.125 if text.count("\n<answer>\n") == 1: count += 0.125 count -= len(text.split("\n</answer>\n")[-1])*0.001 if text.count("\n</answer>") == 1: count += 0.125 count -= (len(text.split("\n</answer>")[-1]) - 1)*0.001 return count def xmlcount_reward_func(completions, **kwargs) -> list[float]: contents = [completion[0]["content"] for completion in completions] return [count_xml(c) for c in contents] output_dir = "outputs/Qwen2.5-7B-GRPO" model_name = "models/Qwen2.5-7B" model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16) # 如果使用lora方法训练，取消如下注释 lora_config = LoraConfig( r=8, lora_alpha=256, target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"], lora_dropout=0.1, task_type=TaskType.CAUSAL_LM ) # 使用lora方法训练 model = get_peft_model(model, lora_config) training_args = GRPOConfig( output_dir=output_dir, learning_rate=5e-6, adam_beta1 = 0.9, adam_beta2 = 0.99, weight_decay = 0.1, warmup_ratio = 0.1, lr_scheduler_type='cosine', logging_steps=1, bf16=True, per_device_train_batch_size=2, gradient_accumulation_steps=4, num_generations=2, max_prompt_length=256, max_completion_length=512, num_train_epochs=1, save_steps=100, max_grad_norm=0.1, log_on_each_node=False, use_vllm=False, report_to="wandb" ) trainer = GRPOTrainer( model = model, # reward_funcs = xmlcount_reward_func, reward_funcs = [ xmlcount_reward_func, soft_format_reward_func, # strict_format_reward_func, int_reward_func, correctness_reward_func, ], args = training_args, train_dataset = dataset, ) trainer.train() trainer.save_model(output_dir) ``` outputs: ![Image](https://github.com/user-attachments/assets/e504b2a5-e431-4220-8d23-08291739d055) ### System Info trl==0.15.1 torch==2.5.1+cu121 deepspeed==0.16.4 ### Checklist - [x] I have checked that my issue isn't already filed (see [open issues](https://github.com/huggingface/trl/issues?q=is%3Aissue)) - [x] I have included my system information - [x] Any code provided is minimal, complete, and reproducible ([more on MREs](https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/creating-and-highlighting-code-blocks)) - [x] Any code provided is properly formatted in code blocks, (no screenshot, [more on code blocks](https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/creating-and-highlighting-code-blocks)) - [x] Any traceback provided is complete

Topic		Replies	Views
Hugging Face Trainer class with accelerate 🤗Accelerate	2	388	May 21, 2024
Hugging Face to GGUF Conversion Broken? 🤗Hub	1	5248	February 11, 2024
Using huggingface transformers trainer method for hugging face datasets 🤗Datasets	1	1095	April 15, 2024
Basics for Multi GPU Training with Huggingface Trainer 🤗Transformers	0	2678	June 14, 2023
Problem loading local dataset using TRL Beginners	0	154	April 15, 2024

Huggingface trl GRPO loss is always zero

Related topics