Huggingface trl GRPO loss is always zero

wantRUC · May 17, 2025, 1:47am

I start a GRPO trainning with this script:

from datasets import load_dataset
import json
dataset_id = "/data/cy/LLMlable/chat/train_data/train0513_grpo.json"


train_dataset = load_dataset("json",data_files=dataset_id,split="train")
def make_conversation(example):
    return {
        "prompt": [
            {"role": "user", "content": example["instruction"]},
        ],
    }
train_dataset = train_dataset.map(make_conversation)


train_dataset = train_dataset.remove_columns(["input", "instruction"])
print(train_dataset)


import torch
from transformers import AutoModelForCausalLM,AutoTokenizer
import os
os.environ["CUDA_HOME"] = '/usr/local/cuda-12.5'
# 将 CUDA 的 bin 目录加入 PATH
os.environ["PATH"] = f"/usr/local/cuda-12.5/bin:{os.environ['PATH']}"
model_id = "/data/cy/LLM/LLaMA-Factory-main/export/test0516"
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto",
    use_cache=False
)
from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    task_type="CAUSAL_LM",
    r=8,
    lora_alpha=32,
    lora_dropout=0.1,
    target_modules=["q_proj", "v_proj"],
    
)

model.gradient_checkpointing_enable()
model = get_peft_model(model, lora_config)

model.print_trainable_parameters()

import re
import psycopg2
import json
import hashlib
import random
rw_path = '/data/cy/LLMlable/grpo/grpo/reward_stack.json'
with open(rw_path,'r') as f:
    rw_dict = json.load(f)
host = "127.0.0.1"
port = "9926"
dbname = "stack"
user = "cy"
password = "SDUcy-202215106"
connection = psycopg2.connect(
    host=host,
    port=port,
    dbname=dbname,
    user=user,
    password=password
)
cursor = connection.cursor()
hintdict = {
    'hash join': 'set enable_hashjoin=',
    'merge join': 'set enable_mergejoin=',
    'nested loop join': 'set enable_nestloop=',
    'index only scan': 'set enable_indexonlyscan=',
    'sequential scan': 'set enable_seqscan=', 
    'index scan': 'set enable_indexscan='
}
sqlpath = '/data/cy/LLMlable/chat/stack/all/'


def getCostPlan(sql,cur):
    cur.execute("explain (COSTS) "+sql)
    rows = cur.fetchall()
    return rows


def read_sql_file(file_path):
    with open(file_path, 'r') as file:
        sql = file.read()
    return sql

# def format_reward(completions, **kwargs):
#     """Reward function that checks if the completion has a specific format."""
#     pattern=r"<think>[\s\S]*?<\/think>\s*hint:\s*\{\s*[\s\S]*?\s*\}\s*"
#     completion_contents = [completion[0]["content"] for completion in completions]
#     for i in completion_contents:
#         print(i)
#     matches = [re.match(pattern, content) for content in completion_contents]
#     rewards_list = [1.0 if match else 0.0 for match in matches]
#     # rewards_list = [1.0 if match else 0.0+random.random() for match in matches]
#     print("f1:",rewards_list)
#     return [1.0 if match else 0.0 for match in matches]

def format_reward2(completions, **kwargs):
    completion_contents = [completion[0]["content"] for completion in completions]
    rw_list=[]
    for content in completion_contents:
        try:
            content=content.split('hint:')[1].replace('}','').replace('{','').replace("\n",'').split(',')
            for term in content:
                term = term.split(':')
                hint_str = hintdict[term[0].strip()]+' '+term[1].strip()+';'
            rw_list.append(1.0)
        except:
            rw_list.append(0.0)
    print("f2:",rw_list)
    return rw_list
            

def hint_reward(completions, **kwargs):
    completion_contents = [completion[0]["content"] for completion in completions]
    rewards_list=[]
    sqlnames = kwargs['sqlname']
    # {hash join: True, merge join: False, nested loop join: True, index scan: True, sequential scan: True, index only scan: False}\n"
    for content,sqlname in zip(completion_contents,sqlnames):
        connection = psycopg2.connect(
                        host=host,
                        port=port,
                        dbname=dbname,
                        user=user,
                        password=password
                    )
        cursor = connection.cursor()
        sql_txt = read_sql_file(sqlpath+sqlname)
        try:
            content=content.split('hint:')[1].replace('}','').replace('{','').replace("\n",'').split(',')
            for term in content:
                term = term.split(':')
                hint_str = hintdict[term[0].strip()]+' '+term[1].strip()+';'
                # print(hint_str)
                cursor.execute(hint_str)
            plan_hash = hashlib.md5((str(getCostPlan(sql_txt,cursor)).encode())).hexdigest()
            rewards_list.append(rw_dict[sqlname][plan_hash])
        except:
            rewards_list.append(0.0)
        cursor.close()
    print("f3:",rewards_list)
    return rewards_list
                
       

    
from trl import GRPOConfig

# Configure training arguments using GRPOConfig
training_args = GRPOConfig(
    output_dir="V2-GRPO-test",
    learning_rate=1e-5,
    remove_unused_columns=False,  # to access the solution column in accuracy_reward
    gradient_accumulation_steps=4,
    num_train_epochs=1,
    bf16=True,
    # Parameters that control de data preprocessing
    max_completion_length=1024,  # default: 256
    num_generations=4,  # default: 8
    max_prompt_length=3072,  # default: 512
    per_device_train_batch_size=1,
    # Parameters related to reporting and saving
    report_to=["wandb"],
    logging_steps=5,
    push_to_hub=False,
    save_strategy="steps",
    save_steps=2,
    gradient_checkpointing=True,
    # processing_class=tokenizer

    
)

from trl import GRPOTrainer


trainer = GRPOTrainer(
    model=model, reward_funcs=[format_reward2,hint_reward], args=training_args, train_dataset=train_dataset
)
trainer.train()
trainer.save_model(training_args.output_dir)

The log shows despite the reward and grad_norm is not zero,the train loss is always zero.But when the training finished,the final output shows that the loss is not zero. Why does this happen? Has my model been trained?

The log:

Dataset({
  features: ['output', 'sqlname', 'prompt'],
  num_rows: 961
})
[2025-05-16 20:49:53,716] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.
Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/torch/utils/checkpoint.py:87: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
{'loss': 0.0, 'grad_norm': 0.09833737462759018, 'learning_rate': 9.833333333333333e-06, 'completion_length': 242.15, 'rewards/format_reward2': 1.0, 'rewards/hint_reward': 0.686456960439682, 'reward': 1.686456948518753, 'reward_std': 0.5303836800158024, 'kl': 0.0004268963893991895, 'epoch': 0.02}
{'loss': 0.0, 'grad_norm': 0.08386103063821793, 'learning_rate': 9.625e-06, 'completion_length': 244.425, 'rewards/format_reward2': 1.0, 'rewards/hint_reward': 0.5837401330471039, 'reward': 1.583740133047104, 'reward_std': 0.6751966059207917, 'kl': 0.0005165024515008554, 'epoch': 0.04}
{'loss': 0.0, 'grad_norm': 0.07626724988222122, 'learning_rate': 9.416666666666667e-06, 'completion_length': 240.85, 'rewards/format_reward2': 1.0, 'rewards/hint_reward': 0.7732090130448341, 'reward': 1.7732090175151825, 'reward_std': 0.7709747180342674, 'kl': 0.0005288927190122194, 'epoch': 0.06}
{'loss': 0.0, 'grad_norm': 0.10117252171039581, 'learning_rate': 9.208333333333333e-06, 'completion_length': 235.5875, 'rewards/format_reward2': 1.0, 'rewards/hint_reward': 0.7216249376535415, 'reward': 1.7216249585151673, 'reward_std': 0.6284362055361271, 'kl': 0.0005364336524507962, 'epoch': 0.08}
                                                 
{'loss': 0.0, 'grad_norm': 0.08160920441150665, 'learning_rate': 9e-06, 'completion_length': 240.0375, 'rewards/format_reward2': 1.0, 'rewards/hint_reward': 0.789138701558113, 'reward': 1.7891387045383453, 'reward_std': 0.6136982448399066, 'kl': 0.0005334435947588645, 'epoch': 0.1}
{'loss': 0.0, 'grad_norm': 0.08980654925107956, 'learning_rate': 8.791666666666667e-06, 'completion_length': 238.0375, 'rewards/format_reward2': 1.0, 'rewards/hint_reward': 0.7379942715168, 'reward': 1.7379942655563354, 'reward_std': 0.6109503719955682, 'kl': 0.0005178835795959458, 'epoch': 0.12}
{'loss': 0.0, 'grad_norm': 0.09846791625022888, 'learning_rate': 8.583333333333333e-06, 'completion_length': 238.1, 'rewards/format_reward2': 1.0, 'rewards/hint_reward': 0.8919083803892136, 'reward': 1.8919083714485168, 'reward_std': 0.7251449711620808, 'kl': 0.0005302477686200291, 'epoch': 0.15}
{'loss': 0.0, 'grad_norm': 0.06589101254940033, 'learning_rate': 8.375e-06, 'completion_length': 228.65, 'rewards/format_reward2': 1.0, 'rewards/hint_reward': 0.7236788332462311, 'reward': 1.7236788332462312, 'reward_std': 0.6145793333649635, 'kl': 0.0004996116273105145, 'epoch': 0.17}
{'loss': 0.0, 'grad_norm': 0.07223747670650482, 'learning_rate': 8.166666666666668e-06, 'completion_length': 240.2375, 'rewards/format_reward2': 1.0, 'rewards/hint_reward': 0.5457767516374588, 'reward': 1.545776754617691, 'reward_std': 0.5292002744972706, 'kl': 0.0005244471380137838, 'epoch': 0.19}
                                                   
{'loss': 0.0, 'grad_norm': 0.09219600260257721, 'learning_rate': 7.958333333333333e-06, 'completion_length': 242.2375, 'rewards/format_reward2': 1.0, 'rewards/hint_reward': 0.7113161787390709, 'reward': 1.7113161981105804, 'reward_std': 0.6662813112139702, 'kl': 0.0005246120665105991, 'epoch': 0.21}
{'loss': 0.0, 'grad_norm': 0.08651315420866013, 'learning_rate': 7.75e-06, 'completion_length': 236.2625, 'rewards/format_reward2': 1.0, 'rewards/hint_reward': 0.7821606770157814, 'reward': 1.7821606695652008, 'reward_std': 0.699380847811699, 'kl': 0.0005228802081546746, 'epoch': 0.23}
{'loss': 0.0, 'grad_norm': 0.08899950981140137, 'learning_rate': 7.541666666666667e-06, 'completion_length': 235.2875, 'rewards/format_reward2': 1.0, 'rewards/hint_reward': 0.6847546353936196, 'reward': 1.6847546219825744, 'reward_std': 0.6952019453048706, 'kl': 0.0005268092500045896, 'epoch': 0.25}
{'loss': 0.0, 'grad_norm': 0.10531917959451675, 'learning_rate': 7.333333333333333e-06, 'completion_length': 244.3, 'rewards/format_reward2': 1.0, 'rewards/hint_reward': 0.80422792583704, 'reward': 1.8042279362678528, 'reward_std': 0.5598676132038236, 'kl': 0.0005284514045342803, 'epoch': 0.27}
{'loss': 0.0, 'grad_norm': 0.09361914545297623, 'learning_rate': 7.125e-06, 'completion_length': 239.8625, 'rewards/format_reward2': 1.0, 'rewards/hint_reward': 0.6599106639623642, 'reward': 1.6599106550216676, 'reward_std': 0.5838687766343356, 'kl': 0.0005130083693074994, 'epoch': 0.29}
{'loss': 0.0, 'grad_norm': 0.08656957000494003, 'learning_rate': 6.916666666666667e-06, 'completion_length': 232.825, 'rewards/format_reward2': 1.0, 'rewards/hint_reward': 0.7171093255281449, 'reward': 1.7171093285083772, 'reward_std': 0.5700982138514519, 'kl': 0.0005371240986278281, 'epoch': 0.31}
{'loss': 0.0, 'grad_norm': 0.09666649252176285, 'learning_rate': 6.708333333333333e-06, 'completion_length': 237.0375, 'rewards/format_reward2': 1.0, 'rewards/hint_reward': 0.8844376325607299, 'reward': 1.8844376266002656, 'reward_std': 0.7487143039703369, 'kl': 0.0005386197401094251, 'epoch': 0.33}
{'loss': 0.0, 'grad_norm': 0.1058727279305458, 'learning_rate': 6.5000000000000004e-06, 'completion_length': 233.45, 'rewards/format_reward2': 1.0, 'rewards/hint_reward': 0.6797272339463234, 'reward': 1.6797272205352782, 'reward_std': 0.6311022289562971, 'kl': 0.0005338093222235329, 'epoch': 0.35}
{'loss': 0.0, 'grad_norm': 0.10363585501909256, 'learning_rate': 6.291666666666667e-06, 'completion_length': 235.6, 'rewards/format_reward2': 0.9875, 'rewards/hint_reward': 0.9810375615954399, 'reward': 1.968537563085556, 'reward_std': 0.786603014729917, 'kl': 0.0005278794225887396, 'epoch': 0.37}
                                                    
{'loss': 0.0, 'grad_norm': 0.09924133121967316, 'learning_rate': 6.083333333333333e-06, 'completion_length': 243.5625, 'rewards/format_reward2': 1.0, 'rewards/hint_reward': 0.6545234180986881, 'reward': 1.6545234203338623, 'reward_std': 0.6480654507875443, 'kl': 0.0005347360609448514, 'epoch': 0.4}
{'loss': 0.0, 'grad_norm': 0.09262462705373764, 'learning_rate': 5.8750000000000005e-06, 'completion_length': 235.1625, 'rewards/format_reward2': 1.0, 'rewards/hint_reward': 0.7519623577594757, 'reward': 1.7519623696804048, 'reward_std': 0.5781447453424334, 'kl': 0.0005310224223649129, 'epoch': 0.42}
{'loss': 0.0, 'grad_norm': 0.10632356256246567, 'learning_rate': 5.666666666666667e-06, 'completion_length': 241.025, 'rewards/format_reward2': 1.0, 'rewards/hint_reward': 0.9946554750204086, 'reward': 1.9946554541587829, 'reward_std': 0.6944329828023911, 'kl': 0.0005232438081293367, 'epoch': 0.44}
{'loss': 0.0, 'grad_norm': 0.0872630849480629, 'learning_rate': 5.458333333333333e-06, 'completion_length': 241.9125, 'rewards/format_reward2': 0.975, 'rewards/hint_reward': 0.875643989443779, 'reward': 1.8506439924240112, 'reward_std': 0.7810414537787438, 'kl': 0.000516696619160939, 'epoch': 0.46}
{'loss': 0.0, 'grad_norm': 0.11046557128429413, 'learning_rate': 5.2500000000000006e-06, 'completion_length': 244.7, 'rewards/format_reward2': 1.0, 'rewards/hint_reward': 0.6702523469924927, 'reward': 1.6702523291110993, 'reward_std': 0.6210360750555992, 'kl': 0.0005639222246827558, 'epoch': 0.48}
{'loss': 0.0, 'grad_norm': 0.10285649448633194, 'learning_rate': 5.041666666666667e-06, 'completion_length': 237.0, 'rewards/format_reward2': 1.0, 'rewards/hint_reward': 0.7387900158762932, 'reward': 1.7387900352478027, 'reward_std': 0.6748368714004755, 'kl': 0.000546219055831898, 'epoch': 0.5}
{'loss': 0.0, 'grad_norm': 0.09859994053840637, 'learning_rate': 4.833333333333333e-06, 'completion_length': 240.4, 'rewards/format_reward2': 1.0, 'rewards/hint_reward': 0.47963430136442187, 'reward': 1.4796343088150024, 'reward_std': 0.6493100047111511, 'kl': 0.0005311622546287254, 'epoch': 0.52}
{'loss': 0.0, 'grad_norm': 0.09000447392463684, 'learning_rate': 4.625000000000001e-06, 'completion_length': 247.2875, 'rewards/format_reward2': 1.0, 'rewards/hint_reward': 0.7482958018779755, 'reward': 1.7482958018779755, 'reward_std': 0.5870657205581665, 'kl': 0.0005441085188067519, 'epoch': 0.54}
{'loss': 0.0, 'grad_norm': 0.11419157683849335, 'learning_rate': 4.416666666666667e-06, 'completion_length': 244.825, 'rewards/format_reward2': 0.9875, 'rewards/hint_reward': 0.47554641366004946, 'reward': 1.4630464136600494, 'reward_std': 0.5691350907087326, 'kl': 0.0005496730314916931, 'epoch': 0.56}
{'loss': 0.0, 'grad_norm': 0.09484495967626572, 'learning_rate': 4.208333333333333e-06, 'completion_length': 239.05, 'rewards/format_reward2': 0.9875, 'rewards/hint_reward': 0.728359380364418, 'reward': 1.7158593893051148, 'reward_std': 0.6138381041586399, 'kl': 0.0005266572508844547, 'epoch': 0.58}
{'loss': 0.0, 'grad_norm': 0.09359879046678543, 'learning_rate': 4.000000000000001e-06, 'completion_length': 249.35, 'rewards/format_reward2': 1.0, 'rewards/hint_reward': 0.7664372086524963, 'reward': 1.7664372265338897, 'reward_std': 0.5865088984370231, 'kl': 0.00055030612857081, 'epoch': 0.6}
{'loss': 0.0, 'grad_norm': 0.11166156828403473, 'learning_rate': 3.7916666666666666e-06, 'completion_length': 240.1125, 'rewards/format_reward2': 1.0, 'rewards/hint_reward': 0.6625506669282913, 'reward': 1.662550675868988, 'reward_std': 0.5158184990286827, 'kl': 0.0005491750562214293, 'epoch': 0.62}
{'loss': 0.0, 'grad_norm': 0.08442356437444687, 'learning_rate': 3.5833333333333335e-06, 'completion_length': 244.85, 'rewards/format_reward2': 1.0, 'rewards/hint_reward': 0.6975038453936577, 'reward': 1.6975038468837738, 'reward_std': 0.6479316338896751, 'kl': 0.0005403956805821508, 'epoch': 0.65}
{'loss': 0.0, 'grad_norm': 0.10485571622848511, 'learning_rate': 3.3750000000000003e-06, 'completion_length': 235.9125, 'rewards/format_reward2': 0.9875, 'rewards/hint_reward': 0.6261389210820199, 'reward': 1.6136389076709747, 'reward_std': 0.7126941554248333, 'kl': 0.0005452550933114253, 'epoch': 0.67}
                                                    
{'loss': 0.0, 'grad_norm': 0.09902128577232361, 'learning_rate': 3.1666666666666667e-06, 'completion_length': 242.7375, 'rewards/format_reward2': 1.0, 'rewards/hint_reward': 0.6681374207139015, 'reward': 1.668137401342392, 'reward_std': 0.596645655343309, 'kl': 0.000601339059357997, 'epoch': 0.69}
{'loss': 0.0, 'grad_norm': 0.09568341821432114, 'learning_rate': 2.9583333333333335e-06, 'completion_length': 237.2625, 'rewards/format_reward2': 0.9875, 'rewards/hint_reward': 0.8848576739430427, 'reward': 1.8723577022552491, 'reward_std': 0.7480486467480659, 'kl': 0.0005603199097095057, 'epoch': 0.71}
{'loss': 0.0, 'grad_norm': 0.10015729814767838, 'learning_rate': 2.7500000000000004e-06, 'completion_length': 236.7125, 'rewards/format_reward2': 1.0, 'rewards/hint_reward': 0.7749321073293686, 'reward': 1.7749320983886718, 'reward_std': 0.6368187621235848, 'kl': 0.0005502376181539149, 'epoch': 0.73}
{'loss': 0.0, 'grad_norm': 0.11633292585611343, 'learning_rate': 2.5416666666666668e-06, 'completion_length': 241.0125, 'rewards/format_reward2': 1.0, 'rewards/hint_reward': 0.7991459637880325, 'reward': 1.7991459548473359, 'reward_std': 0.7132919415831566, 'kl': 0.0005318806826835499, 'epoch': 0.75}
{'loss': 0.0, 'grad_norm': 0.10906606912612915, 'learning_rate': 2.3333333333333336e-06, 'completion_length': 242.3875, 'rewards/format_reward2': 1.0, 'rewards/hint_reward': 0.7020174562931061, 'reward': 1.702017456293106, 'reward_std': 0.5429980456829071, 'kl': 0.0005515233700862154, 'epoch': 0.77}
{'loss': 0.0, 'grad_norm': 0.12483209371566772, 'learning_rate': 2.125e-06, 'completion_length': 242.7625, 'rewards/format_reward2': 1.0, 'rewards/hint_reward': 0.6960464447736741, 'reward': 1.6960464298725129, 'reward_std': 0.5978964403271675, 'kl': 0.0005584630722296424, 'epoch': 0.79}
{'loss': 0.0, 'grad_norm': 0.11350805312395096, 'learning_rate': 1.916666666666667e-06, 'completion_length': 239.95, 'rewards/format_reward2': 1.0, 'rewards/hint_reward': 0.632606404274702, 'reward': 1.6326064050197602, 'reward_std': 0.6396717444062233, 'kl': 0.0005443066140287556, 'epoch': 0.81}
{'loss': 0.0, 'grad_norm': 0.10226169228553772, 'learning_rate': 1.7083333333333334e-06, 'completion_length': 242.6, 'rewards/format_reward2': 1.0, 'rewards/hint_reward': 0.8669334009289742, 'reward': 1.8669333696365356, 'reward_std': 0.6857013538479805, 'kl': 0.0005572879337705672, 'epoch': 0.83}
{'loss': 0.0, 'grad_norm': 0.09106634557247162, 'learning_rate': 1.5e-06, 'completion_length': 240.65, 'rewards/format_reward2': 1.0, 'rewards/hint_reward': 0.7581478208303452, 'reward': 1.7581478118896485, 'reward_std': 0.6972749218344688, 'kl': 0.0005479031256982126, 'epoch': 0.85}

{'loss': 0.0, 'grad_norm': 0.12189562618732452, 'learning_rate': 1.2916666666666669e-06, 'completion_length': 232.925, 'rewards/format_reward2': 1.0, 'rewards/hint_reward': 0.8270958885550499, 'reward': 1.827095890045166, 'reward_std': 0.7235485404729843, 'kl': 0.0006022430883604102, 'epoch': 0.87}
{'loss': 0.0, 'grad_norm': 0.10620136559009552, 'learning_rate': 1.0833333333333335e-06, 'completion_length': 232.5375, 'rewards/format_reward2': 1.0, 'rewards/hint_reward': 0.881186357140541, 'reward': 1.8811863839626313, 'reward_std': 0.6683604184538126, 'kl': 0.000592767032503616, 'epoch': 0.89}
{'loss': 0.0, 'grad_norm': 0.08141123503446579, 'learning_rate': 8.75e-07, 'completion_length': 234.375, 'rewards/format_reward2': 0.9875, 'rewards/hint_reward': 0.5150951027870179, 'reward': 1.5025951206684112, 'reward_std': 0.5887993931770324, 'kl': 0.0005780345440143719, 'epoch': 0.92}
{'loss': 0.0, 'grad_norm': 0.11707420647144318, 'learning_rate': 6.666666666666667e-07, 'completion_length': 234.9375, 'rewards/format_reward2': 0.9875, 'rewards/hint_reward': 0.6512034311890602, 'reward': 1.6387034356594086, 'reward_std': 0.6193146544974297, 'kl': 0.0006213093263795599, 'epoch': 0.94}
{'loss': 0.0, 'grad_norm': 0.10141977667808533, 'learning_rate': 4.583333333333333e-07, 'completion_length': 237.45, 'rewards/format_reward2': 1.0, 'rewards/hint_reward': 0.6260499149560929, 'reward': 1.6260499119758607, 'reward_std': 0.7117581814527512, 'kl': 0.0005733551224693656, 'epoch': 0.96}
                                           
{'loss': 0.0, 'grad_norm': 0.10804533213376999, 'learning_rate': 2.5000000000000004e-07, 'completion_length': 235.6875, 'rewards/format_reward2': 1.0, 'rewards/hint_reward': 0.5845954522490502, 'reward': 1.5845954477787019, 'reward_std': 0.6889735788106919, 'kl': 0.0005797993973828853, 'epoch': 0.98}
{'loss': 0.0, 'grad_norm': 0.09732513874769211, 'learning_rate': 4.166666666666667e-08, 'completion_length': 239.2375, 'rewards/format_reward2': 1.0, 'rewards/hint_reward': 0.744280219078064, 'reward': 1.7442802131175994, 'reward_std': 0.6373357579112053, 'kl': 0.0005513303462066687, 'epoch': 1.0}
{'train_runtime': 32920.2439, 'train_samples_per_second': 0.029, 'train_steps_per_second': 0.007, 'train_loss': 2.1668560384568992e-05, 'epoch': 1.0}

wantRUC · May 17, 2025, 2:17am

env:

Package                   Version
------------------------- --------------
accelerate                1.6.0
aiohappyeyeballs          2.6.1
aiohttp                   3.11.18
aiosignal                 1.3.2
annotated-types           0.7.0
antlr4-python3-runtime    4.13.2
anyio                     4.9.0
argon2-cffi               23.1.0
argon2-cffi-bindings      21.2.0
arrow                     1.3.0
asttokens                 3.0.0
async-lru                 2.0.5
attrs                     25.3.0
babel                     2.17.0
beautifulsoup4            4.13.4
bitsandbytes              0.45.3
bleach                    6.2.0
certifi                   2025.4.26
cffi                      1.17.1
charset-normalizer        3.4.2
click                     8.2.0
comm                      0.2.2
datasets                  3.6.0
debugpy                   1.8.14
decorator                 5.2.1
deepspeed                 0.16.7
defusedxml                0.7.1
dill                      0.3.8
docker-pycreds            0.4.0
einops                    0.8.1
executing                 2.2.0
fastjsonschema            2.21.1
filelock                  3.18.0
fqdn                      1.5.1
frozenlist                1.6.0
fsspec                    2025.3.0
gitdb                     4.0.12
GitPython                 3.1.44
h11                       0.16.0
hf-xet                    1.1.0
hjson                     3.1.0
httpcore                  1.0.9
httpx                     0.28.1
huggingface-hub           0.31.1
idna                      3.10
ipykernel                 6.29.5
ipython                   9.2.0
ipython_pygments_lexers   1.1.1
ipywidgets                8.1.7
isoduration               20.11.0
jedi                      0.19.2
Jinja2                    3.1.6
json5                     0.12.0
jsonpointer               3.0.0
jsonschema                4.23.0
jsonschema-specifications 2025.4.1
jupyter                   1.1.1
jupyter_client            8.6.3
jupyter-console           6.6.3
jupyter_core              5.7.2
jupyter-events            0.12.0
jupyter-lsp               2.2.5
jupyter_server            2.15.0
jupyter_server_terminals  0.5.3
jupyterlab                4.4.2
jupyterlab_pygments       0.3.0
jupyterlab_server         2.27.3
jupyterlab_widgets        3.0.15
latex2sympy2_extended     1.10.1
loralib                   0.1.2
markdown-it-py            3.0.0
MarkupSafe                3.0.2
math-verify               0.7.0
matplotlib-inline         0.1.7
mdurl                     0.1.2
mistune                   3.1.3
mo-dots                   10.672.25036
mo-future                 7.672.25036
mo-imports                7.672.25036
mo-parsing                8.675.25037
mo-sql-parsing            11.675.25037
mpmath                    1.3.0
msgpack                   1.1.0
multidict                 6.4.3
multiprocess              0.70.16
nbclient                  0.10.2
nbconvert                 7.16.6
nbformat                  5.10.4
nest-asyncio              1.6.0
networkx                  3.4.2
ninja                     1.11.1.4
notebook                  7.4.2
notebook_shim             0.2.4
numpy                     2.2.5
nvidia-cublas-cu12        12.4.5.8
nvidia-cuda-cupti-cu12    12.4.127
nvidia-cuda-nvrtc-cu12    12.4.127
nvidia-cuda-runtime-cu12  12.4.127
nvidia-cudnn-cu12         9.1.0.70
nvidia-cufft-cu12         11.2.1.3
nvidia-cufile-cu12        1.11.1.6
nvidia-curand-cu12        10.3.5.147
nvidia-cusolver-cu12      11.6.1.9
nvidia-cusparse-cu12      12.3.1.170
nvidia-cusparselt-cu12    0.6.2
nvidia-ml-py              12.570.86
nvidia-nccl-cu12          2.21.5
nvidia-nvjitlink-cu12     12.4.127
nvidia-nvtx-cu12          12.4.127
nvitop                    1.5.0
overrides                 7.7.0
packaging                 25.0
pandas                    2.2.3
pandocfilters             1.5.1
parso                     0.8.4
peft                      0.15.2
pexpect                   4.9.0
pip                       25.1
platformdirs              4.3.8
prometheus_client         0.21.1
prompt_toolkit            3.0.51
propcache                 0.3.1
protobuf                  6.30.2
psutil                    7.0.0
psycopg2                  2.9.10
ptyprocess                0.7.0
pure_eval                 0.2.3
py-cpuinfo                9.0.0
pyarrow                   20.0.0
pycparser                 2.22
pydantic                  2.11.4
pydantic_core             2.33.2
Pygments                  2.19.1
python-dateutil           2.9.0.post0
python-json-logger        3.3.0
pytz                      2025.2
PyYAML                    6.0.2
pyzmq                     26.4.0
referencing               0.36.2
regex                     2024.11.6
requests                  2.32.3
rfc3339-validator         0.1.4
rfc3986-validator         0.1.1
rich                      14.0.0
rpds-py                   0.24.0
safetensors               0.5.3
Send2Trash                1.8.3
sentry-sdk                2.27.0
setproctitle              1.3.6
setuptools                78.1.1
six                       1.17.0
smmap                     5.0.2
sniffio                   1.3.1
soupsieve                 2.7
stack-data                0.6.3
sympy                     1.13.1
tensorboardX              2.6.2.2
terminado                 0.18.1
tinycss2                  1.4.0
tokenizers                0.21.1
torch                     2.6.0
tornado                   6.4.2
tqdm                      4.67.1
traitlets                 5.14.3
transformers              4.51.3
triton                    3.2.0
trl                       0.14.0
types-python-dateutil     2.9.0.20241206
typing_extensions         4.13.2
typing-inspection         0.4.0
tzdata                    2025.2
uri-template              1.3.0
urllib3                   2.4.0
wandb                     0.19.11
wcwidth                   0.2.13
webcolors                 24.11.1
webencodings              0.5.1
websocket-client          1.8.0
wheel                     0.45.1
widgetsnbextension        4.0.14
xxhash                    3.5.0
yarl                      1.20.0

John6666 · May 17, 2025, 6:19am

There seems to be a significant peculiarity in GRPO loss.

github.com/huggingface/open-r1

Why does the loss start at 0 when I train GRPO, and then possibly increase?

opened 04:59AM - 08 Feb 25 UTC

closed 01:33AM - 10 Feb 25 UTC

hellen9527

I am using the distill-1.5b model, and since I only have 4 L20 GPUs, I modified …some parameters and am still training the GRPO model on the NuminaMath-TIR dataset. However, I noticed that the loss remains 0, and I'm not sure where the configuration went wrong. I have ensured that the software versions match those in the setup.py file, and I also updated TRL and transformers to the latest version of the main branch. The specific logs and training configuration are as follows. I would like to know if this is normal and how to fix it. train config： ``` # Model arguments model_name_or_path: /home/base-model/deepseek-r1-distill-qwen-1.5b model_revision: main torch_dtype: bfloat16 # Num processes is less by 1 as vLLM is using 1 GPU num_processes: 3 # GRPO trainer config gradient_accumulation_steps: 2 num_generations: 3 ``` train log ``` [INFO|trainer.py:2348] 2025-02-08 12:02:29,782 >> ***** Running training ***** [INFO|trainer.py:2349] 2025-02-08 12:02:29,782 >> Num examples = 72,441 [INFO|trainer.py:2350] 2025-02-08 12:02:29,782 >> Num Epochs = 1 [INFO|trainer.py:2351] 2025-02-08 12:02:29,782 >> Instantaneous batch size per device = 1 [INFO|trainer.py:2354] 2025-02-08 12:02:29,782 >> Total train batch size (w. parallel, distributed & accumulation) = 6 [INFO|trainer.py:2355] 2025-02-08 12:02:29,782 >> Gradient Accumulation steps = 2 [INFO|trainer.py:2356] 2025-02-08 12:02:29,782 >> Total optimization steps = 36,220 [INFO|trainer.py:2357] 2025-02-08 12:02:29,783 >> Number of trainable parameters = 1,777,088,000 {'loss': 0.0, 'grad_norm': 0.72000175680703, 'learning_rate': 2.760905577029266e-08, 'rewards/accuracy_reward': 0.26666667461395266, 'rewards/format_reward': 0.0, 'rewards/reasoning_steps_reward': 0.6777778208255768, 'rewards/cosine_scaled_reward': -0.022902203630656003, 'reward': 0.921542277932167, 'reward_std': 0.871876309812069, 'completion_length': 876.4000122070313, 'kl': 0.00035610198974609373, 'epoch': 0.0} {'loss': 0.0, 'grad_norm': 0.8210723493263515, 'learning_rate': 5.521811154058532e-08, 'rewards/accuracy_reward': 0.10000000298023223, 'rewards/format_reward': 0.0, 'rewards/reasoning_steps_reward': 0.6333333641290665, 'rewards/cosine_scaled_reward': -0.23128306418657302, 'reward': 0.5020502872765065, 'reward_std': 0.43509662076830863, 'completion_length': 884.033349609375, 'kl': 0.0006114959716796875, 'epoch': 0.0} {'loss': 0.0, 'grad_norm': 0.6075981772711617, 'learning_rate': 8.282716731087798e-08, 'rewards/accuracy_reward': 0.1666666716337204, 'rewards/format_reward': 0.0, 'rewards/reasoning_steps_reward': 0.5555555850267411, 'rewards/cosine_scaled_reward': -0.16871370139997452, 'reward': 0.5535085469484329, 'reward_std': 0.6925141368061304, 'completion_length': 886.1666809082031, 'kl': 0.0005586624145507812, 'epoch': 0.0} {'loss': 0.0, 'grad_norm': 0.7033610775329348, 'learning_rate': 1.1043622308117064e-07, 'rewards/accuracy_reward': 0.1666666716337204, 'rewards/format_reward': 0.0, 'rewards/reasoning_steps_reward': 0.6888889163732529, 'rewards/cosine_scaled_reward': -0.17193117612041534, 'reward': 0.6836243975907564, 'reward_std': 0.7369554199278354, 'completion_length': 892.0000122070312, 'kl': 0.00048828125, 'epoch': 0.0} ... {'loss': 0.0001, 'grad_norm': 0.6114522070289464, 'learning_rate': 1.049144119271121e-06, 'rewards/accuracy_reward': 0.3000000089406967, 'rewards/format_reward': 0.0, 'rewards/reasoning_steps_reward': 0.7333333641290665, 'rewards/cosine_scaled_reward': -0.05265774726867676, 'reward': 0.9806756511330604, 'reward_std': 0.8146779596805572, 'completion_length': 926.8666748046875, 'kl': 0.001399993896484375, 'epoch': 0.01} {'loss': 0.0001, 'grad_norm': 0.6375849273871735, 'learning_rate': 1.0767531750414136e-06, 'rewards/accuracy_reward': 0.1666666716337204, 'rewards/format_reward': 0.0, 'rewards/reasoning_steps_reward': 0.7111111462116242, 'rewards/cosine_scaled_reward': -0.14114616215229034, 'reward': 0.736631666123867, 'reward_std': 0.7692775622010231, 'completion_length': 937.2000122070312, 'kl': 0.001470184326171875, 'epoch': 0.01} {'loss': 0.0001, 'grad_norm': 0.7375909133054507, 'learning_rate': 1.1043622308117063e-06, 'rewards/accuracy_reward': 0.36666667759418486, 'rewards/format_reward': 0.0, 'rewards/reasoning_steps_reward': 0.844444477558136, 'rewards/cosine_scaled_reward': 0.036993000144138935, 'reward': 1.2481041848659515, 'reward_std': 1.0289975732564927, 'completion_length': 829.4000122070313, 'kl': 0.0028339385986328124, 'epoch': 0.01}

github.com/huggingface/trl

loss is always 0 when use LoraConfig in GRPOtrainer

opened 12:23PM - 22 Feb 25 UTC

closed 12:40PM - 22 Feb 25 UTC

Tuziking

⚡ PEFT 🏋 GRPO

### Reproduction ```python import re import torch from datasets import load_dat…aset, Dataset from transformers import AutoTokenizer, AutoModelForCausalLM from trl import GRPOConfig, GRPOTrainer from peft import LoraConfig, get_peft_model, TaskType import wandb import logging from scripts.utils.replace_grpo_trainer import trigger # 配置日志 logging.basicConfig( filename="GRPO-Qwen2.5-7B.log", # 保存的日志文件名 level=logging.INFO, # 日志等级 format="%(asctime)s - %(message)s", # 日志格式 datefmt="%Y-%m-%d %H:%M:%S" ) # 创建日志对象 logger = logging.getLogger("logger") # Load and prep dataset SYSTEM_PROMPT = """ A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the think process in the mind and then provides the user with the answer. The think process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., <think> think process here </think> <answer> answer here </answer> """ XML_COT_FORMAT = """ <think> {think} </think> <answer> {answer} </answer> """ def extract_xml_answer(text: str) -> str: answer = text.split("<answer>")[-1] answer = answer.split("</answer>")[0] return answer.strip() def extract_hash_answer(text: str) -> str | None: if "####" not in text: return None return text.split("####")[1].strip() # uncomment middle messages for 1-shot prompting def get_gsm8k_questions(split = "train") -> Dataset: data = load_dataset('dataset/gsm8k', 'main')[split] # type: ignore data = data.map(lambda x: { # type: ignore 'prompt': [ {'role': 'system', 'content': SYSTEM_PROMPT}, {'role': 'user', 'content': x['question']} ], 'answer': extract_hash_answer(x['answer']) }) # type: ignore return data # type: ignore dataset = get_gsm8k_questions() # Reward functions def correctness_reward_func(prompts, completions, answer, **kwargs) -> list[float]: responses = [completion[0]['content'] for completion in completions] q = prompts[0][-1]['content'] extracted_responses = [extract_xml_answer(r) for r in responses] print(len(responses), len(extracted_responses), len(answer)) # for response, extracted_response, _answer in zip(responses, extracted_responses, answer): logger.info('-'*20) logger.info(f"Question:\n{q}") logger.info(f"Answer:\n{answer[0]}") logger.info(f"Response:\n{responses[0]}") logger.info(f"Extracted:\n{extracted_responses[0]}") logger.info(f"Correctness: {1.0 if extracted_responses[0] == answer[0] else 0.0}")f"\nExtracted:\n{extracted_responses[0]}") return [2.0 if r == a else 0.0 for r, a in zip(extracted_responses, answer)] def int_reward_func(completions, **kwargs) -> list[float]: responses = [completion[0]['content'] for completion in completions] extracted_responses = [extract_xml_answer(r) for r in responses] return [0.5 if r.isdigit() else 0.0 for r in extracted_responses] def soft_format_reward_func(completions, **kwargs) -> list[float]: """Reward function that checks if the completion has a specific format.""" pattern = r"<think>.*?</think>\s*<answer>.*?</answer>" responses = [completion[0]["content"] for completion in completions] matches = [re.match(pattern, r) for r in responses] return [0.5 if match else 0.0 for match in matches] def count_xml(text) -> float: count = 0.0 if text.count("<think>\n") == 1: count += 0.125 if text.count("\n</think>\n") == 1: count += 0.125 if text.count("\n<answer>\n") == 1: count += 0.125 count -= len(text.split("\n</answer>\n")[-1])*0.001 if text.count("\n</answer>") == 1: count += 0.125 count -= (len(text.split("\n</answer>")[-1]) - 1)*0.001 return count def xmlcount_reward_func(completions, **kwargs) -> list[float]: contents = [completion[0]["content"] for completion in completions] return [count_xml(c) for c in contents] output_dir = "outputs/Qwen2.5-7B-GRPO" model_name = "models/Qwen2.5-7B" model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16) # 如果使用lora方法训练，取消如下注释 lora_config = LoraConfig( r=8, lora_alpha=256, target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"], lora_dropout=0.1, task_type=TaskType.CAUSAL_LM ) # 使用lora方法训练 model = get_peft_model(model, lora_config) training_args = GRPOConfig( output_dir=output_dir, learning_rate=5e-6, adam_beta1 = 0.9, adam_beta2 = 0.99, weight_decay = 0.1, warmup_ratio = 0.1, lr_scheduler_type='cosine', logging_steps=1, bf16=True, per_device_train_batch_size=2, gradient_accumulation_steps=4, num_generations=2, max_prompt_length=256, max_completion_length=512, num_train_epochs=1, save_steps=100, max_grad_norm=0.1, log_on_each_node=False, use_vllm=False, report_to="wandb" ) trainer = GRPOTrainer( model = model, # reward_funcs = xmlcount_reward_func, reward_funcs = [ xmlcount_reward_func, soft_format_reward_func, # strict_format_reward_func, int_reward_func, correctness_reward_func, ], args = training_args, train_dataset = dataset, ) trainer.train() trainer.save_model(output_dir) ``` outputs: ![Image](https://github.com/user-attachments/assets/e504b2a5-e431-4220-8d23-08291739d055) ### System Info trl==0.15.1 torch==2.5.1+cu121 deepspeed==0.16.4 ### Checklist - [x] I have checked that my issue isn't already filed (see [open issues](https://github.com/huggingface/trl/issues?q=is%3Aissue)) - [x] I have included my system information - [x] Any code provided is minimal, complete, and reproducible ([more on MREs](https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/creating-and-highlighting-code-blocks)) - [x] Any code provided is properly formatted in code blocks, (no screenshot, [more on code blocks](https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/creating-and-highlighting-code-blocks)) - [x] Any traceback provided is complete

bird-of-paradise · May 18, 2025, 12:39am

Hi there! After taking a look at your logs and the TRL library’s GRPO implementation, I think I can help explain what’s happening with your “loss always zero” issue:

Your model IS being trained, and here’s why:

Math checks out: I calculated 0.0005513303462066687 (your final KL) * 0.04 (default beta) = 2.205321384826675e-05, which almost exactly matches your reported final train_loss: 2.1668560384568992e-05. This confirms the loss is being calculated correctly as beta * KL.
Non-zero gradients: Your grad_norm values are consistently non-zero (ranging ~0.06-0.12), which proves your model parameters are being updated throughout training.
Working as designed: When using num_iterations=1 (the default) with grpo_trainer, the normalized advantages sum to zero, so the only contribution to loss comes from the KL term. This matches exactly what qgallouedec explained here in that GitHub thread John6666 referenced. and fyi, your loss type is default to bnpo in the trainer file. see implementation

Some observations about the HF implementation:

Logging inconsistency: I’m not sure why your intermediate logs show loss: 0.0 while the final loss is non-zero. It could be a display/precision issue with the TRL library’s logging.
Reference model behavior: According to the DeepSeek paper (Algorithm 1, page 14), the reference model should be periodically updated to match the current policy model, which would reset the KL to zero. Your logs show KL steadily increasing from ~0.0004 to ~0.0006 without any resets, which suggests the HF implementation might differ from the paper in this aspect.
Default settings: HF’s TRL library uses num_iterations=1 by default, which simplifies the GRPO family objective function considerably.

My assessment:

Is your model being trained? Yes, definitely.
Is the training optimal? Probably not - there appear to be some differences between the TRL implementation and the full iterative approach described in the DeepSeek paper.

Suggestions:

Try setting num_iterations=4 or some value greater than 1 in your GRPOConfig to see if it improves training dynamics.
Check out the insights from qgallouedec in the GitHub thread - their mathematical explanation of why loss starts at zero and increases during training is spot on.

Disclaimer: I am a fellow community member studying the papers and implementations of the TRL library closely , but not the author of it. My intention is to help, if I’ve misunderstood anything, I welcome corrections from those more familiar with the codebase!

Hope this helps clarify things!

wantRUC · May 18, 2025, 1:20am

Thanks a lot!

wantRUC · May 18, 2025, 1:22am

Thanks for the detailed explanation!Appreciate!

Topic		Replies	Views
Hugging Face Trainer class with accelerate 🤗Accelerate	2	389	May 21, 2024
Hugging Face to GGUF Conversion Broken? 🤗Hub	1	5266	February 11, 2024
Using huggingface transformers trainer method for hugging face datasets 🤗Datasets	1	1097	April 15, 2024
Basics for Multi GPU Training with Huggingface Trainer 🤗Transformers	0	2684	June 14, 2023
Problem loading local dataset using TRL Beginners	0	155	April 15, 2024

Huggingface trl GRPO loss is always zero

Your model IS being trained, and here’s why:

Some observations about the HF implementation:

My assessment:

Suggestions:

Related topics