I am using the distill-1.5b model, and since I only have 4 L20 GPUs, I modified …some parameters and am still training the GRPO model on the NuminaMath-TIR dataset. However, I noticed that the loss remains 0, and I'm not sure where the configuration went wrong. I have ensured that the software versions match those in the setup.py file, and I also updated TRL and transformers to the latest version of the main branch. The specific logs and training configuration are as follows. I would like to know if this is normal and how to fix it.
train config:
```
# Model arguments
model_name_or_path: /home/base-model/deepseek-r1-distill-qwen-1.5b
model_revision: main
torch_dtype: bfloat16
# Num processes is less by 1 as vLLM is using 1 GPU
num_processes: 3
# GRPO trainer config
gradient_accumulation_steps: 2
num_generations: 3
```
train log
```
[INFO|trainer.py:2348] 2025-02-08 12:02:29,782 >> ***** Running training *****
[INFO|trainer.py:2349] 2025-02-08 12:02:29,782 >> Num examples = 72,441
[INFO|trainer.py:2350] 2025-02-08 12:02:29,782 >> Num Epochs = 1
[INFO|trainer.py:2351] 2025-02-08 12:02:29,782 >> Instantaneous batch size per device = 1
[INFO|trainer.py:2354] 2025-02-08 12:02:29,782 >> Total train batch size (w. parallel, distributed & accumulation) = 6
[INFO|trainer.py:2355] 2025-02-08 12:02:29,782 >> Gradient Accumulation steps = 2
[INFO|trainer.py:2356] 2025-02-08 12:02:29,782 >> Total optimization steps = 36,220
[INFO|trainer.py:2357] 2025-02-08 12:02:29,783 >> Number of trainable parameters = 1,777,088,000
{'loss': 0.0, 'grad_norm': 0.72000175680703, 'learning_rate': 2.760905577029266e-08, 'rewards/accuracy_reward': 0.26666667461395266, 'rewards/format_reward': 0.0, 'rewards/reasoning_steps_reward': 0.6777778208255768, 'rewards/cosine_scaled_reward': -0.022902203630656003, 'reward': 0.921542277932167, 'reward_std': 0.871876309812069, 'completion_length': 876.4000122070313, 'kl': 0.00035610198974609373, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': 0.8210723493263515, 'learning_rate': 5.521811154058532e-08, 'rewards/accuracy_reward': 0.10000000298023223, 'rewards/format_reward': 0.0, 'rewards/reasoning_steps_reward': 0.6333333641290665, 'rewards/cosine_scaled_reward': -0.23128306418657302, 'reward': 0.5020502872765065, 'reward_std': 0.43509662076830863, 'completion_length': 884.033349609375, 'kl': 0.0006114959716796875, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': 0.6075981772711617, 'learning_rate': 8.282716731087798e-08, 'rewards/accuracy_reward': 0.1666666716337204, 'rewards/format_reward': 0.0, 'rewards/reasoning_steps_reward': 0.5555555850267411, 'rewards/cosine_scaled_reward': -0.16871370139997452, 'reward': 0.5535085469484329, 'reward_std': 0.6925141368061304, 'completion_length': 886.1666809082031, 'kl': 0.0005586624145507812, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': 0.7033610775329348, 'learning_rate': 1.1043622308117064e-07, 'rewards/accuracy_reward': 0.1666666716337204, 'rewards/format_reward': 0.0, 'rewards/reasoning_steps_reward': 0.6888889163732529, 'rewards/cosine_scaled_reward': -0.17193117612041534, 'reward': 0.6836243975907564, 'reward_std': 0.7369554199278354, 'completion_length': 892.0000122070312, 'kl': 0.00048828125, 'epoch': 0.0}
...
{'loss': 0.0001, 'grad_norm': 0.6114522070289464, 'learning_rate': 1.049144119271121e-06, 'rewards/accuracy_reward': 0.3000000089406967, 'rewards/format_reward': 0.0, 'rewards/reasoning_steps_reward': 0.7333333641290665, 'rewards/cosine_scaled_reward': -0.05265774726867676, 'reward': 0.9806756511330604, 'reward_std': 0.8146779596805572, 'completion_length': 926.8666748046875, 'kl': 0.001399993896484375, 'epoch': 0.01}
{'loss': 0.0001, 'grad_norm': 0.6375849273871735, 'learning_rate': 1.0767531750414136e-06, 'rewards/accuracy_reward': 0.1666666716337204, 'rewards/format_reward': 0.0, 'rewards/reasoning_steps_reward': 0.7111111462116242, 'rewards/cosine_scaled_reward': -0.14114616215229034, 'reward': 0.736631666123867, 'reward_std': 0.7692775622010231, 'completion_length': 937.2000122070312, 'kl': 0.001470184326171875, 'epoch': 0.01}
{'loss': 0.0001, 'grad_norm': 0.7375909133054507, 'learning_rate': 1.1043622308117063e-06, 'rewards/accuracy_reward': 0.36666667759418486, 'rewards/format_reward': 0.0, 'rewards/reasoning_steps_reward': 0.844444477558136, 'rewards/cosine_scaled_reward': 0.036993000144138935, 'reward': 1.2481041848659515, 'reward_std': 1.0289975732564927, 'completion_length': 829.4000122070313, 'kl': 0.0028339385986328124, 'epoch': 0.01}