Issues about deepspeed&Qlora&SFT: RuntimeError: grad can be implicitly created only for scalar outputs

I got problems when doing a sft on deepseekv2-prover:7B
My script:

from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments,BitsAndBytesConfig
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer,SFTConfig
from trl import AutoModelForCausalLMWithValueHead
from datasets import load_dataset
import torch

# print(torch.cuda.device_count())

# 加载 tokenizer
model_name = "/data/cy/LLMlable/grpo/v2"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,                      # Enable 4-bit quantization
    bnb_4bit_quant_type="nf4",              # Use NF4 (NormalFloat4) quantization
    bnb_4bit_use_double_quant=True,         # Enable double quantization for better memory efficiency
    bnb_4bit_compute_dtype=torch.bfloat16   # Compute dtype (bfloat16 is recommended)
)

# 配置 LoRA
lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    # target_modules=["q_proj", "v_proj"]
)
model = AutoModelForCausalLMWithValueHead.from_pretrained(
    model_name,
    trust_remote_code=True,
    quantization_config=quantization_config,
    # peft_config=lora_config,

    # device_map="auto"
)

model.enable_input_require_grads()
# 应用 LoRA
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
# 加载数据集
dataset = load_dataset("/data/cy/LLMlable/grpo/dataset", split="train")

# 训练参数
training_args = SFTConfig(
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,
    learning_rate=2e-4,
    optim="paged_adamw_8bit",
    logging_steps=10,
    save_steps=500,
    bf16=True,
    gradient_checkpointing=True,
    output_dir="./results",
    num_train_epochs=1,
    logging_dir="./logs",
    # max_length=1024,
)

# 创建 Trainer
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    args=training_args,
    
)

# 开始训练
trainer.train()

deepspeed config:

compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  deepspeed_multinode_launcher: standard
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: true
  zero3_save_16bit_model: true
  zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

the log:

(hf_trl) cy@amax:/data/cy/LLMlable/grpo/sft$ accelerate launch --config_file /data/cy/LLMlable/grpo/trl-0.14-release/examples/accelerate_configs/deepspeed_zero3.yaml sft.py 
[2025-05-13 21:21:19,495] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect)
W0513 21:21:21.928000 3461535 /data/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/torch/distributed/run.py:792] 
W0513 21:21:21.928000 3461535 /data/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/torch/distributed/run.py:792] *****************************************
W0513 21:21:21.928000 3461535 /data/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/torch/distributed/run.py:792] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0513 21:21:21.928000 3461535 /data/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/torch/distributed/run.py:792] *****************************************
[2025-05-13 21:21:25,798] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-05-13 21:21:25,917] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-05-13 21:21:28,891] [INFO] [comm.py:669:init_distributed] cdb=None
[2025-05-13 21:21:29,109] [INFO] [comm.py:669:init_distributed] cdb=None
[2025-05-13 21:21:29,109] [INFO] [comm.py:700:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
`rope_scaling`'s factor field must be a float >= 1, got 16
`rope_scaling`'s beta_fast field must be a float, got 32
`rope_scaling`'s beta_slow field must be a float, got 1
`rope_scaling`'s factor field must be a float >= 1, got 16
`rope_scaling`'s beta_fast field must be a float, got 32
`rope_scaling`'s beta_slow field must be a float, got 1
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:15<00:00,  7.74s/it]
WARNING:root:A <class 'transformers.models.llama.modeling_llama.LlamaForCausalLM'> model is loaded from '/data/cy/LLMlable/grpo/v2', and no v_head weight is found. This IS expected if you are not resuming PPO training.
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:15<00:00,  7.81s/it]
WARNING:root:A <class 'transformers.models.llama.modeling_llama.LlamaForCausalLM'> model is loaded from '/data/cy/LLMlable/grpo/v2', and no v_head weight is found. This IS expected if you are not resuming PPO training.
trainable params: 3,932,160 || all params: 6,914,301,953 || trainable%: 0.0569
trainable params: 3,932,160 || all params: 6,914,301,953 || trainable%: 0.0569
/data/cy/LLMlable/grpo/sft/sft.py:64: FutureWarning: `tokenizer` is deprecated and removed starting from version 0.16.0 for `SFTTrainer.__init__`. Use `processing_class` instead.
  trainer = SFTTrainer(
[rank1]:[W513 21:21:45.901396698 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 1]  using GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id.
/data/cy/LLMlable/grpo/sft/sft.py:64: FutureWarning: `tokenizer` is deprecated and removed starting from version 0.16.0 for `SFTTrainer.__init__`. Use `processing_class` instead.
  trainer = SFTTrainer(
[rank0]:[W513 21:21:45.059297766 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 0]  using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id.
/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/trl/trainer/sft_trainer.py:300: UserWarning: You passed a processing_class with `padding_side` not equal to `right` to the SFTTrainer. This might lead to some unexpected behaviour due to overflow issues when training a model in half-precision. You might consider adding `processing_class.padding_side = 'right'` to your code.
  warnings.warn(
WARNING:accelerate.utils.other:Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.
/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/trl/trainer/sft_trainer.py:300: UserWarning: You passed a processing_class with `padding_side` not equal to `right` to the SFTTrainer. This might lead to some unexpected behaviour due to overflow issues when training a model in half-precision. You might consider adding `processing_class.padding_side = 'right'` to your code.
  warnings.warn(
No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.
[2025-05-13 21:21:46,625] [WARNING] [engine.py:1338:_do_optimizer_sanity_check] **** You are using ZeRO with an untested optimizer, proceed with caution *****
Parameter Offload: Total persistent parameters: 4186113 in 183 params
wandb: WARNING The `run_name` is currently set to the same value as `TrainingArguments.output_dir`. If this was not intended, please specify a different run name by setting the `TrainingArguments.run_name` parameter.

wandb: Tracking run with wandb version 0.19.11
wandb: Run data is saved locally in /data/cy/LLMlable/grpo/sft/wandb/
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run ./results
wandb: ⭐️ View project at
wandb: 🚀 
  0%|                                                                                                                                    | 0/987 [00:00<?, ?it/s][rank1]: Traceback (most recent call last):
[rank1]:   File "/data/cy/LLMlable/grpo/sft/sft.py", line 73, in <module>
[rank1]:     trainer.train()
[rank1]:   File "/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/transformers/trainer.py", line 2245, in train
[rank1]:     return inner_training_loop(
[rank1]:            ^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/transformers/trainer.py", line 2560, in _inner_training_loop
[rank1]:     tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank1]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/transformers/trainer.py", line 3782, in training_step
[rank1]:     self.accelerator.backward(loss, **kwargs)
[rank1]:   File "/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/accelerate/accelerator.py", line 2446, in backward
[rank1]:     self.deepspeed_engine_wrapped.backward(loss, **kwargs)
[rank1]:   File "/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/accelerate/utils/deepspeed.py", line 266, in backward
[rank1]:     self.engine.backward(loss, **kwargs)
[rank1]:   File "/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 20, in wrapped_fn
[rank1]:     ret_val = func(*args, **kwargs)
[rank1]:               ^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2216, in backward
[rank1]:     self._do_optimizer_backward(loss, retain_graph)
[rank1]:   File "/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2162, in _do_optimizer_backward
[rank1]:     self.optimizer.backward(loss, retain_graph=retain_graph)
[rank1]:   File "/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 20, in wrapped_fn
[rank1]:     ret_val = func(*args, **kwargs)
[rank1]:               ^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 2280, in backward
[rank1]:     self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
[rank1]:   File "/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
[rank1]:     scaled_loss.backward(retain_graph=retain_graph)
[rank1]:   File "/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/torch/_tensor.py", line 626, in backward
[rank1]:     torch.autograd.backward(
[rank1]:   File "/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/torch/autograd/__init__.py", line 340, in backward
[rank1]:     grad_tensors_ = _make_grads(tensors, grad_tensors_, is_grads_batched=False)
[rank1]:                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/torch/autograd/__init__.py", line 198, in _make_grads
[rank1]:     raise RuntimeError(
[rank1]: RuntimeError: grad can be implicitly created only for scalar outputs
Traceback (most recent call last):
  File "/data/cy/LLMlable/grpo/sft/sft.py", line 73, in <module>
    trainer.train()
  File "/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/transformers/trainer.py", line 2245, in train
    return inner_training_loop(
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/transformers/trainer.py", line 2560, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/transformers/trainer.py", line 3782, in training_step
    self.accelerator.backward(loss, **kwargs)
  File "/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/accelerate/accelerator.py", line 2446, in backward
    self.deepspeed_engine_wrapped.backward(loss, **kwargs)
  File "/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/accelerate/utils/deepspeed.py", line 266, in backward
    self.engine.backward(loss, **kwargs)
  File "/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 20, in wrapped_fn
    ret_val = func(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^
  File "/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2216, in backward
    self._do_optimizer_backward(loss, retain_graph)
  File "/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2162, in _do_optimizer_backward
    self.optimizer.backward(loss, retain_graph=retain_graph)
  File "/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 20, in wrapped_fn
    ret_val = func(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^
  File "/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 2280, in backward
    self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
  File "/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
    scaled_loss.backward(retain_graph=retain_graph)
  File "/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/torch/_tensor.py", line 626, in backward
    torch.autograd.backward(
  File "/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/torch/autograd/__init__.py", line 340, in backward
    grad_tensors_ = _make_grads(tensors, grad_tensors_, is_grads_batched=False)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/torch/autograd/__init__.py", line 198, in _make_grads
    raise RuntimeError(
RuntimeError: grad can be implicitly created only for scalar outputs
[rank0]: Traceback (most recent call last):
[rank0]:   File "/data/cy/LLMlable/grpo/sft/sft.py", line 73, in <module>
[rank0]:     trainer.train()
[rank0]:   File "/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/transformers/trainer.py", line 2245, in train
[rank0]:     return inner_training_loop(
[rank0]:            ^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/transformers/trainer.py", line 2560, in _inner_training_loop
[rank0]:     tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank0]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/transformers/trainer.py", line 3782, in training_step
[rank0]:     self.accelerator.backward(loss, **kwargs)
[rank0]:   File "/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/accelerate/accelerator.py", line 2446, in backward
[rank0]:     self.deepspeed_engine_wrapped.backward(loss, **kwargs)
[rank0]:   File "/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/accelerate/utils/deepspeed.py", line 266, in backward
[rank0]:     self.engine.backward(loss, **kwargs)
[rank0]:   File "/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 20, in wrapped_fn
[rank0]:     ret_val = func(*args, **kwargs)
[rank0]:               ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2216, in backward
[rank0]:     self._do_optimizer_backward(loss, retain_graph)
[rank0]:   File "/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2162, in _do_optimizer_backward
[rank0]:     self.optimizer.backward(loss, retain_graph=retain_graph)
[rank0]:   File "/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 20, in wrapped_fn
[rank0]:     ret_val = func(*args, **kwargs)
[rank0]:               ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 2280, in backward
[rank0]:     self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
[rank0]:   File "/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
[rank0]:     scaled_loss.backward(retain_graph=retain_graph)
[rank0]:   File "/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/torch/_tensor.py", line 626, in backward
[rank0]:     torch.autograd.backward(
[rank0]:   File "/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/torch/autograd/__init__.py", line 340, in backward
[rank0]:     grad_tensors_ = _make_grads(tensors, grad_tensors_, is_grads_batched=False)
[rank0]:                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/torch/autograd/__init__.py", line 198, in _make_grads
[rank0]:     raise RuntimeError(
[rank0]: RuntimeError: grad can be implicitly created only for scalar outputs
W0513 21:21:55.117000 3461535 /data/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3461734 closing signal SIGTERM
E0513 21:21:55.434000 3461535 /data/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 1 (pid: 3461735) of binary: /home/cy/anaconda3/envs/hf_trl/bin/python
Traceback (most recent call last):
  File "/home/cy/anaconda3/envs/hf_trl/bin/accelerate", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 50, in main
    args.func(args)
  File "/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1196, in launch_command
    deepspeed_launcher(args)
  File "/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/accelerate/commands/launch.py", line 878, in deepspeed_launcher
    distrib_run.run(args)
  File "/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/torch/distributed/run.py", line 909, in run
    elastic_launch(
  File "/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
sft.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-05-13_21:21:55
  host      : 
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 3461735)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

My env:

8*RTX3080 10G
Package                   Version
------------------------- --------------
accelerate                1.6.0
aiohappyeyeballs          2.6.1
aiohttp                   3.11.18
aiosignal                 1.3.2
annotated-types           0.7.0
anyio                     4.9.0
argon2-cffi               23.1.0
argon2-cffi-bindings      21.2.0
arrow                     1.3.0
asttokens                 3.0.0
async-lru                 2.0.5
attrs                     25.3.0
babel                     2.17.0
beautifulsoup4            4.13.4
bitsandbytes              0.45.3
bleach                    6.2.0
certifi                   2025.4.26
cffi                      1.17.1
charset-normalizer        3.4.2
click                     8.2.0
comm                      0.2.2
datasets                  3.6.0
debugpy                   1.8.14
decorator                 5.2.1
deepspeed                 0.16.7
defusedxml                0.7.1
dill                      0.3.8
docker-pycreds            0.4.0
einops                    0.8.1
executing                 2.2.0
fastjsonschema            2.21.1
filelock                  3.18.0
fqdn                      1.5.1
frozenlist                1.6.0
fsspec                    2025.3.0
gitdb                     4.0.12
GitPython                 3.1.44
h11                       0.16.0
hf-xet                    1.1.0
hjson                     3.1.0
httpcore                  1.0.9
httpx                     0.28.1
huggingface-hub           0.31.1
idna                      3.10
ipykernel                 6.29.5
ipython                   9.2.0
ipython_pygments_lexers   1.1.1
ipywidgets                8.1.7
isoduration               20.11.0
jedi                      0.19.2
Jinja2                    3.1.6
json5                     0.12.0
jsonpointer               3.0.0
jsonschema                4.23.0
jsonschema-specifications 2025.4.1
jupyter                   1.1.1
jupyter_client            8.6.3
jupyter-console           6.6.3
jupyter_core              5.7.2
jupyter-events            0.12.0
jupyter-lsp               2.2.5
jupyter_server            2.15.0
jupyter_server_terminals  0.5.3
jupyterlab                4.4.2
jupyterlab_pygments       0.3.0
jupyterlab_server         2.27.3
jupyterlab_widgets        3.0.15
loralib                   0.1.2
markdown-it-py            3.0.0
MarkupSafe                3.0.2
matplotlib-inline         0.1.7
mdurl                     0.1.2
mistune                   3.1.3
mpmath                    1.3.0
msgpack                   1.1.0
multidict                 6.4.3
multiprocess              0.70.16
nbclient                  0.10.2
nbconvert                 7.16.6
nbformat                  5.10.4
nest-asyncio              1.6.0
networkx                  3.4.2
ninja                     1.11.1.4
notebook                  7.4.2
notebook_shim             0.2.4
numpy                     2.2.5
nvidia-cublas-cu12        12.4.5.8
nvidia-cuda-cupti-cu12    12.4.127
nvidia-cuda-nvrtc-cu12    12.4.127
nvidia-cuda-runtime-cu12  12.4.127
nvidia-cudnn-cu12         9.1.0.70
nvidia-cufft-cu12         11.2.1.3
nvidia-cufile-cu12        1.11.1.6
nvidia-curand-cu12        10.3.5.147
nvidia-cusolver-cu12      11.6.1.9
nvidia-cusparse-cu12      12.3.1.170
nvidia-cusparselt-cu12    0.6.2
nvidia-ml-py              12.570.86
nvidia-nccl-cu12          2.21.5
nvidia-nvjitlink-cu12     12.4.127
nvidia-nvtx-cu12          12.4.127
nvitop                    1.5.0
overrides                 7.7.0
packaging                 25.0
pandas                    2.2.3
pandocfilters             1.5.1
parso                     0.8.4
peft                      0.15.2
pexpect                   4.9.0
pip                       25.1
platformdirs              4.3.8
prometheus_client         0.21.1
prompt_toolkit            3.0.51
propcache                 0.3.1
protobuf                  6.30.2
psutil                    7.0.0
ptyprocess                0.7.0
pure_eval                 0.2.3
py-cpuinfo                9.0.0
pyarrow                   20.0.0
pycparser                 2.22
pydantic                  2.11.4
pydantic_core             2.33.2
Pygments                  2.19.1
python-dateutil           2.9.0.post0
python-json-logger        3.3.0
pytz                      2025.2
PyYAML                    6.0.2
pyzmq                     26.4.0
referencing               0.36.2
regex                     2024.11.6
requests                  2.32.3
rfc3339-validator         0.1.4
rfc3986-validator         0.1.1
rich                      14.0.0
rpds-py                   0.24.0
safetensors               0.5.3
Send2Trash                1.8.3
sentry-sdk                2.27.0
setproctitle              1.3.6
setuptools                78.1.1
six                       1.17.0
smmap                     5.0.2
sniffio                   1.3.1
soupsieve                 2.7
stack-data                0.6.3
sympy                     1.13.1
terminado                 0.18.1
tinycss2                  1.4.0
tokenizers                0.21.1
torch                     2.6.0
tornado                   6.4.2
tqdm                      4.67.1
traitlets                 5.14.3
transformers              4.51.3
triton                    3.2.0
trl                       0.14.0
types-python-dateutil     2.9.0.20241206
typing_extensions         4.13.2
typing-inspection         0.4.0
tzdata                    2025.2
uri-template              1.3.0
urllib3                   2.4.0
wandb                     0.19.11
wcwidth                   0.2.13
webcolors                 24.11.1
webencodings              0.5.1
websocket-client          1.8.0
wheel                     0.45.1
widgetsnbextension        4.0.14
xxhash                    3.5.0
yarl                      1.20.0
1 Like

How can I check where got wrong?

1 Like

the dataset:

dataset_info:
features:

  • name: source
    dtype: string
  • name: messages
    list:
    • name: content
      dtype: string
    • name: role
      dtype: string
  • name: num_turns
    dtype: int64
    splits:
  • name: train
    num_bytes: 71908734
    num_examples: 15806
  • name: test
    num_bytes: 929564
    num_examples: 200
    download_size: 37644679
    dataset_size: 72838298
    configs:
  • config_name: default
    data_files:
    • split: train
      path: data/train-*
    • split: test
      path: data/test-*

1 Like

It seems that this error is quite rare in Transoformers.

https://stackoverflow.com/questions/63924567/gpt2-on-hugging-facepytorch-transformers-runtimeerror-grad-can-be-implicitly

1 Like

Thanks a lot! I will try it.

1 Like