Issues about deepspeed&Qlora&SFT: RuntimeError: grad can be implicitly created only for scalar outputs

wantRUC · May 13, 2025, 1:24pm

I got problems when doing a sft on deepseekv2-prover:7B
My script:

from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments,BitsAndBytesConfig
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer,SFTConfig
from trl import AutoModelForCausalLMWithValueHead
from datasets import load_dataset
import torch

# print(torch.cuda.device_count())

# 加载 tokenizer
model_name = "/data/cy/LLMlable/grpo/v2"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,                      # Enable 4-bit quantization
    bnb_4bit_quant_type="nf4",              # Use NF4 (NormalFloat4) quantization
    bnb_4bit_use_double_quant=True,         # Enable double quantization for better memory efficiency
    bnb_4bit_compute_dtype=torch.bfloat16   # Compute dtype (bfloat16 is recommended)
)

# 配置 LoRA
lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    # target_modules=["q_proj", "v_proj"]
)
model = AutoModelForCausalLMWithValueHead.from_pretrained(
    model_name,
    trust_remote_code=True,
    quantization_config=quantization_config,
    # peft_config=lora_config,

    # device_map="auto"
)

model.enable_input_require_grads()
# 应用 LoRA
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
# 加载数据集
dataset = load_dataset("/data/cy/LLMlable/grpo/dataset", split="train")

# 训练参数
training_args = SFTConfig(
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,
    learning_rate=2e-4,
    optim="paged_adamw_8bit",
    logging_steps=10,
    save_steps=500,
    bf16=True,
    gradient_checkpointing=True,
    output_dir="./results",
    num_train_epochs=1,
    logging_dir="./logs",
    # max_length=1024,
)

# 创建 Trainer
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    args=training_args,
    
)

# 开始训练
trainer.train()

deepspeed config:

compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  deepspeed_multinode_launcher: standard
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: true
  zero3_save_16bit_model: true
  zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

the log:

(hf_trl) cy@amax:/data/cy/LLMlable/grpo/sft$ accelerate launch --config_file /data/cy/LLMlable/grpo/trl-0.14-release/examples/accelerate_configs/deepspeed_zero3.yaml sft.py 
[2025-05-13 21:21:19,495] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect)
W0513 21:21:21.928000 3461535 /data/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/torch/distributed/run.py:792] 
W0513 21:21:21.928000 3461535 /data/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/torch/distributed/run.py:792] *****************************************
W0513 21:21:21.928000 3461535 /data/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/torch/distributed/run.py:792] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0513 21:21:21.928000 3461535 /data/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/torch/distributed/run.py:792] *****************************************
[2025-05-13 21:21:25,798] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-05-13 21:21:25,917] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-05-13 21:21:28,891] [INFO] [comm.py:669:init_distributed] cdb=None
[2025-05-13 21:21:29,109] [INFO] [comm.py:669:init_distributed] cdb=None
[2025-05-13 21:21:29,109] [INFO] [comm.py:700:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
`rope_scaling`'s factor field must be a float >= 1, got 16
`rope_scaling`'s beta_fast field must be a float, got 32
`rope_scaling`'s beta_slow field must be a float, got 1
`rope_scaling`'s factor field must be a float >= 1, got 16
`rope_scaling`'s beta_fast field must be a float, got 32
`rope_scaling`'s beta_slow field must be a float, got 1
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:15<00:00,  7.74s/it]
WARNING:root:A <class 'transformers.models.llama.modeling_llama.LlamaForCausalLM'> model is loaded from '/data/cy/LLMlable/grpo/v2', and no v_head weight is found. This IS expected if you are not resuming PPO training.
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:15<00:00,  7.81s/it]
WARNING:root:A <class 'transformers.models.llama.modeling_llama.LlamaForCausalLM'> model is loaded from '/data/cy/LLMlable/grpo/v2', and no v_head weight is found. This IS expected if you are not resuming PPO training.
trainable params: 3,932,160 || all params: 6,914,301,953 || trainable%: 0.0569
trainable params: 3,932,160 || all params: 6,914,301,953 || trainable%: 0.0569
/data/cy/LLMlable/grpo/sft/sft.py:64: FutureWarning: `tokenizer` is deprecated and removed starting from version 0.16.0 for `SFTTrainer.__init__`. Use `processing_class` instead.
  trainer = SFTTrainer(
[rank1]:[W513 21:21:45.901396698 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 1]  using GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id.
/data/cy/LLMlable/grpo/sft/sft.py:64: FutureWarning: `tokenizer` is deprecated and removed starting from version 0.16.0 for `SFTTrainer.__init__`. Use `processing_class` instead.
  trainer = SFTTrainer(
[rank0]:[W513 21:21:45.059297766 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 0]  using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id.
/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/trl/trainer/sft_trainer.py:300: UserWarning: You passed a processing_class with `padding_side` not equal to `right` to the SFTTrainer. This might lead to some unexpected behaviour due to overflow issues when training a model in half-precision. You might consider adding `processing_class.padding_side = 'right'` to your code.
  warnings.warn(
WARNING:accelerate.utils.other:Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.
/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/trl/trainer/sft_trainer.py:300: UserWarning: You passed a processing_class with `padding_side` not equal to `right` to the SFTTrainer. This might lead to some unexpected behaviour due to overflow issues when training a model in half-precision. You might consider adding `processing_class.padding_side = 'right'` to your code.
  warnings.warn(
No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.
[2025-05-13 21:21:46,625] [WARNING] [engine.py:1338:_do_optimizer_sanity_check] **** You are using ZeRO with an untested optimizer, proceed with caution *****
Parameter Offload: Total persistent parameters: 4186113 in 183 params
wandb: WARNING The `run_name` is currently set to the same value as `TrainingArguments.output_dir`. If this was not intended, please specify a different run name by setting the `TrainingArguments.run_name` parameter.

wandb: Tracking run with wandb version 0.19.11
wandb: Run data is saved locally in /data/cy/LLMlable/grpo/sft/wandb/
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run ./results
wandb: ⭐️ View project at
wandb: 🚀 
  0%|                                                                                                                                    | 0/987 [00:00<?, ?it/s][rank1]: Traceback (most recent call last):
[rank1]:   File "/data/cy/LLMlable/grpo/sft/sft.py", line 73, in <module>
[rank1]:     trainer.train()
[rank1]:   File "/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/transformers/trainer.py", line 2245, in train
[rank1]:     return inner_training_loop(
[rank1]:            ^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/transformers/trainer.py", line 2560, in _inner_training_loop
[rank1]:     tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank1]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/transformers/trainer.py", line 3782, in training_step
[rank1]:     self.accelerator.backward(loss, **kwargs)
[rank1]:   File "/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/accelerate/accelerator.py", line 2446, in backward
[rank1]:     self.deepspeed_engine_wrapped.backward(loss, **kwargs)
[rank1]:   File "/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/accelerate/utils/deepspeed.py", line 266, in backward
[rank1]:     self.engine.backward(loss, **kwargs)
[rank1]:   File "/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 20, in wrapped_fn
[rank1]:     ret_val = func(*args, **kwargs)
[rank1]:               ^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2216, in backward
[rank1]:     self._do_optimizer_backward(loss, retain_graph)
[rank1]:   File "/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2162, in _do_optimizer_backward
[rank1]:     self.optimizer.backward(loss, retain_graph=retain_graph)
[rank1]:   File "/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 20, in wrapped_fn
[rank1]:     ret_val = func(*args, **kwargs)
[rank1]:               ^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 2280, in backward
[rank1]:     self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
[rank1]:   File "/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
[rank1]:     scaled_loss.backward(retain_graph=retain_graph)
[rank1]:   File "/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/torch/_tensor.py", line 626, in backward
[rank1]:     torch.autograd.backward(
[rank1]:   File "/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/torch/autograd/__init__.py", line 340, in backward
[rank1]:     grad_tensors_ = _make_grads(tensors, grad_tensors_, is_grads_batched=False)
[rank1]:                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/torch/autograd/__init__.py", line 198, in _make_grads
[rank1]:     raise RuntimeError(
[rank1]: RuntimeError: grad can be implicitly created only for scalar outputs
Traceback (most recent call last):
  File "/data/cy/LLMlable/grpo/sft/sft.py", line 73, in <module>
    trainer.train()
  File "/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/transformers/trainer.py", line 2245, in train
    return inner_training_loop(
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/transformers/trainer.py", line 2560, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/transformers/trainer.py", line 3782, in training_step
    self.accelerator.backward(loss, **kwargs)
  File "/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/accelerate/accelerator.py", line 2446, in backward
    self.deepspeed_engine_wrapped.backward(loss, **kwargs)
  File "/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/accelerate/utils/deepspeed.py", line 266, in backward
    self.engine.backward(loss, **kwargs)
  File "/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 20, in wrapped_fn
    ret_val = func(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^
  File "/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2216, in backward
    self._do_optimizer_backward(loss, retain_graph)
  File "/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2162, in _do_optimizer_backward
    self.optimizer.backward(loss, retain_graph=retain_graph)
  File "/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 20, in wrapped_fn
    ret_val = func(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^
  File "/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 2280, in backward
    self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
  File "/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
    scaled_loss.backward(retain_graph=retain_graph)
  File "/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/torch/_tensor.py", line 626, in backward
    torch.autograd.backward(
  File "/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/torch/autograd/__init__.py", line 340, in backward
    grad_tensors_ = _make_grads(tensors, grad_tensors_, is_grads_batched=False)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/torch/autograd/__init__.py", line 198, in _make_grads
    raise RuntimeError(
RuntimeError: grad can be implicitly created only for scalar outputs
[rank0]: Traceback (most recent call last):
[rank0]:   File "/data/cy/LLMlable/grpo/sft/sft.py", line 73, in <module>
[rank0]:     trainer.train()
[rank0]:   File "/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/transformers/trainer.py", line 2245, in train
[rank0]:     return inner_training_loop(
[rank0]:            ^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/transformers/trainer.py", line 2560, in _inner_training_loop
[rank0]:     tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank0]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/transformers/trainer.py", line 3782, in training_step
[rank0]:     self.accelerator.backward(loss, **kwargs)
[rank0]:   File "/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/accelerate/accelerator.py", line 2446, in backward
[rank0]:     self.deepspeed_engine_wrapped.backward(loss, **kwargs)
[rank0]:   File "/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/accelerate/utils/deepspeed.py", line 266, in backward
[rank0]:     self.engine.backward(loss, **kwargs)
[rank0]:   File "/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 20, in wrapped_fn
[rank0]:     ret_val = func(*args, **kwargs)
[rank0]:               ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2216, in backward
[rank0]:     self._do_optimizer_backward(loss, retain_graph)
[rank0]:   File "/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2162, in _do_optimizer_backward
[rank0]:     self.optimizer.backward(loss, retain_graph=retain_graph)
[rank0]:   File "/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 20, in wrapped_fn
[rank0]:     ret_val = func(*args, **kwargs)
[rank0]:               ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 2280, in backward
[rank0]:     self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
[rank0]:   File "/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
[rank0]:     scaled_loss.backward(retain_graph=retain_graph)
[rank0]:   File "/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/torch/_tensor.py", line 626, in backward
[rank0]:     torch.autograd.backward(
[rank0]:   File "/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/torch/autograd/__init__.py", line 340, in backward
[rank0]:     grad_tensors_ = _make_grads(tensors, grad_tensors_, is_grads_batched=False)
[rank0]:                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/torch/autograd/__init__.py", line 198, in _make_grads
[rank0]:     raise RuntimeError(
[rank0]: RuntimeError: grad can be implicitly created only for scalar outputs
W0513 21:21:55.117000 3461535 /data/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3461734 closing signal SIGTERM
E0513 21:21:55.434000 3461535 /data/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 1 (pid: 3461735) of binary: /home/cy/anaconda3/envs/hf_trl/bin/python
Traceback (most recent call last):
  File "/home/cy/anaconda3/envs/hf_trl/bin/accelerate", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 50, in main
    args.func(args)
  File "/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1196, in launch_command
    deepspeed_launcher(args)
  File "/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/accelerate/commands/launch.py", line 878, in deepspeed_launcher
    distrib_run.run(args)
  File "/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/torch/distributed/run.py", line 909, in run
    elastic_launch(
  File "/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cy/anaconda3/envs/hf_trl/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
sft.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-05-13_21:21:55
  host      : 
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 3461735)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

My env:

8*RTX3080 10G
Package                   Version
------------------------- --------------
accelerate                1.6.0
aiohappyeyeballs          2.6.1
aiohttp                   3.11.18
aiosignal                 1.3.2
annotated-types           0.7.0
anyio                     4.9.0
argon2-cffi               23.1.0
argon2-cffi-bindings      21.2.0
arrow                     1.3.0
asttokens                 3.0.0
async-lru                 2.0.5
attrs                     25.3.0
babel                     2.17.0
beautifulsoup4            4.13.4
bitsandbytes              0.45.3
bleach                    6.2.0
certifi                   2025.4.26
cffi                      1.17.1
charset-normalizer        3.4.2
click                     8.2.0
comm                      0.2.2
datasets                  3.6.0
debugpy                   1.8.14
decorator                 5.2.1
deepspeed                 0.16.7
defusedxml                0.7.1
dill                      0.3.8
docker-pycreds            0.4.0
einops                    0.8.1
executing                 2.2.0
fastjsonschema            2.21.1
filelock                  3.18.0
fqdn                      1.5.1
frozenlist                1.6.0
fsspec                    2025.3.0
gitdb                     4.0.12
GitPython                 3.1.44
h11                       0.16.0
hf-xet                    1.1.0
hjson                     3.1.0
httpcore                  1.0.9
httpx                     0.28.1
huggingface-hub           0.31.1
idna                      3.10
ipykernel                 6.29.5
ipython                   9.2.0
ipython_pygments_lexers   1.1.1
ipywidgets                8.1.7
isoduration               20.11.0
jedi                      0.19.2
Jinja2                    3.1.6
json5                     0.12.0
jsonpointer               3.0.0
jsonschema                4.23.0
jsonschema-specifications 2025.4.1
jupyter                   1.1.1
jupyter_client            8.6.3
jupyter-console           6.6.3
jupyter_core              5.7.2
jupyter-events            0.12.0
jupyter-lsp               2.2.5
jupyter_server            2.15.0
jupyter_server_terminals  0.5.3
jupyterlab                4.4.2
jupyterlab_pygments       0.3.0
jupyterlab_server         2.27.3
jupyterlab_widgets        3.0.15
loralib                   0.1.2
markdown-it-py            3.0.0
MarkupSafe                3.0.2
matplotlib-inline         0.1.7
mdurl                     0.1.2
mistune                   3.1.3
mpmath                    1.3.0
msgpack                   1.1.0
multidict                 6.4.3
multiprocess              0.70.16
nbclient                  0.10.2
nbconvert                 7.16.6
nbformat                  5.10.4
nest-asyncio              1.6.0
networkx                  3.4.2
ninja                     1.11.1.4
notebook                  7.4.2
notebook_shim             0.2.4
numpy                     2.2.5
nvidia-cublas-cu12        12.4.5.8
nvidia-cuda-cupti-cu12    12.4.127
nvidia-cuda-nvrtc-cu12    12.4.127
nvidia-cuda-runtime-cu12  12.4.127
nvidia-cudnn-cu12         9.1.0.70
nvidia-cufft-cu12         11.2.1.3
nvidia-cufile-cu12        1.11.1.6
nvidia-curand-cu12        10.3.5.147
nvidia-cusolver-cu12      11.6.1.9
nvidia-cusparse-cu12      12.3.1.170
nvidia-cusparselt-cu12    0.6.2
nvidia-ml-py              12.570.86
nvidia-nccl-cu12          2.21.5
nvidia-nvjitlink-cu12     12.4.127
nvidia-nvtx-cu12          12.4.127
nvitop                    1.5.0
overrides                 7.7.0
packaging                 25.0
pandas                    2.2.3
pandocfilters             1.5.1
parso                     0.8.4
peft                      0.15.2
pexpect                   4.9.0
pip                       25.1
platformdirs              4.3.8
prometheus_client         0.21.1
prompt_toolkit            3.0.51
propcache                 0.3.1
protobuf                  6.30.2
psutil                    7.0.0
ptyprocess                0.7.0
pure_eval                 0.2.3
py-cpuinfo                9.0.0
pyarrow                   20.0.0
pycparser                 2.22
pydantic                  2.11.4
pydantic_core             2.33.2
Pygments                  2.19.1
python-dateutil           2.9.0.post0
python-json-logger        3.3.0
pytz                      2025.2
PyYAML                    6.0.2
pyzmq                     26.4.0
referencing               0.36.2
regex                     2024.11.6
requests                  2.32.3
rfc3339-validator         0.1.4
rfc3986-validator         0.1.1
rich                      14.0.0
rpds-py                   0.24.0
safetensors               0.5.3
Send2Trash                1.8.3
sentry-sdk                2.27.0
setproctitle              1.3.6
setuptools                78.1.1
six                       1.17.0
smmap                     5.0.2
sniffio                   1.3.1
soupsieve                 2.7
stack-data                0.6.3
sympy                     1.13.1
terminado                 0.18.1
tinycss2                  1.4.0
tokenizers                0.21.1
torch                     2.6.0
tornado                   6.4.2
tqdm                      4.67.1
traitlets                 5.14.3
transformers              4.51.3
triton                    3.2.0
trl                       0.14.0
types-python-dateutil     2.9.0.20241206
typing_extensions         4.13.2
typing-inspection         0.4.0
tzdata                    2025.2
uri-template              1.3.0
urllib3                   2.4.0
wandb                     0.19.11
wcwidth                   0.2.13
webcolors                 24.11.1
webencodings              0.5.1
websocket-client          1.8.0
wheel                     0.45.1
widgetsnbextension        4.0.14
xxhash                    3.5.0
yarl                      1.20.0

wantRUC · May 13, 2025, 1:24pm

How can I check where got wrong?

wantRUC · May 13, 2025, 1:30pm

the dataset：

dataset_info:
features:

name: source
dtype: string
name: messages
list:
- name: content
  dtype: string
- name: role
  dtype: string
name: num_turns
dtype: int64
splits:
name: train
num_bytes: 71908734
num_examples: 15806
name: test
num_bytes: 929564
num_examples: 200
download_size: 37644679
dataset_size: 72838298
configs:
config_name: default
data_files:
- split: train
  path: data/train-*
- split: test
  path: data/test-*

John6666 · May 14, 2025, 7:57am

It seems that this error is quite rare in Transoformers.

github.com/huggingface/transformers

RuntimeError: grad can be implicitly created only for scalar outputs

opened 06:24PM - 26 Aug 20 UTC

closed 05:06PM - 21 Oct 20 UTC

aclifton314

## System Info Pop!_OS 20.04 Pytorch: 1.5.1 Transformers: 3.0.2 Tokenizers: …0.8.1rc1 Python: 3.7.6 Pretrained Model: GPT2 Pretrained Tokenizer: GPT2 ## Question I'm getting the following error and I'm not sure how to resolve it: ```python Some weights of GPT2LMHeadModel were not initialized from the model checkpoint at gpt2 and are newly initialized: ['h.0.attn.masked_bias', 'h.1.attn.masked_bias', 'h.2.attn.masked_bias', 'h.3.attn.masked_bias', 'h.4.attn.masked_bias', 'h.5.attn.masked_bias', 'h.6.attn.masked_bias', 'h.7.attn.masked_bias', 'h.8.attn.masked_bias', 'h.9.attn.masked_bias', 'h.10.attn.masked_bias', 'h.11.attn.masked_bias', 'lm_head.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. Epoch: 0%| | 0/1 [00:00<?, ?it/s] Iteration: 0%| | 0/4 [00:00<?, ?it/s]Traceback (most recent call last): File "/path/to/misc_tests.py", line 78, in <module> trainer.train() File "/path/to/anaconda3/lib/python3.7/site-packages/transformers/trainer.py", line 499, in train tr_loss += self._training_step(model, inputs, optimizer) File "/path/to/anaconda3/lib/python3.7/site-packages/transformers/trainer.py", line 637, in _training_step loss.backward() File "/path/to/anaconda3/lib/python3.7/site-packages/torch/tensor.py", line 198, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/path/to/anaconda3/lib/python3.7/site-packages/torch/autograd/__init__.py", line 94, in backward grad_tensors = _make_grads(tensors, grad_tensors) File "/path/to/anaconda3/lib/python3.7/site-packages/torch/autograd/__init__.py", line 35, in _make_grads raise RuntimeError("grad can be implicitly created only for scalar outputs") RuntimeError: grad can be implicitly created only for scalar outputs ``` Here's some sample code: ```python from transformers import Trainer, TrainingArguments, GPT2LMHeadModel, GPT2Tokenizer import torch from torch.utils.data import Dataset class SDAbstractsDataset(Dataset): def __init__(self, set_type_str): if set_type_str == 'train': prompt1 = 'We present an update on the results of the Double Chooz experiment. Double Chooz searches for the neutrino mixing angle, ÃÂ¸13, in the three-neutrino mixing matrix via the disappearance of produced by the dual 4.27 GW/th Chooz B Reactors. Here we discuss updated oscillation fit results using both the rate and the shape of the anti-neutrino energy spectrum. In the most recent oscillation analysis we included data with neutron captures on Gadolinium and Hydrogen along with the reactor off data that we collected. This is an important step in our multi-year program to establish the value of ÃÂ¸13.' prompt2 = 'The paper covers detailed discussion on novel control system developed for adaptive fluid-based shock-absorbers serving for mitigation of unknown impact excitations. In order to provide complete independence of the control system from the loading conditions, the Hybrid Prediction Control (HPC) was elaborated. The proposed method is an extension of previously introduced kinematic feedback control which ensures optimal path finding, tracking and path update in case of high disturbance or sudden change of loading conditions. Implementation of the presented control system allows to obtain self-adaptive fluid-based absorbers providing robust impact mitigation. In contrast to previously developed methods of Adaptive Impact Absorption, the proposed control strategy does not require prior knowledge of impact excitation or its preliminary identification. The independence of applied control system from parameters of impact loading results in the capability of automatic path correction in the case of disturbance occurrence and re-adaptation to a number of subsequent impacts. The successful operation of the self-adaptive system is investigated with the use of numerical examples involving double-chamber pneumatic shock-absorber equipped with controllable valve. Efficiency of the HPC is proved by comparison with passive absorber as well as device equipped with adaptive and optimal control modules.' prompt3 = 'This study aimed to produce biosurfactant from Pseudozyma tsukubaensis using cassava wastewater and an inoculum (biomass) for galactooligosaccharides synthesis from lactose as an integrated system. First, the use of cassava wastewater as a low cost culture medium by P. tsukubaensis to produce biomass and biosurfactant was evaluated and optimized. Then, the microbial cells (biomass) obtained from the optimized process were used to produce galactooligosaccharides from lactose. The optimum conditions for biosurfactant and biomass synthesis were found to be 80% (v/v) of cassava wastewater at 30ÃÂ°C and 200rpm for 48h. The highest concentration of biosurfactant, that is, minimum surface tension value and maximum biomass concentration predicted were experimentally confirmed as 26.87mN/m and 10.5g/L, respectively. The biosurfactant obtained showed good thermal (121ÃÂ°C/1h), pH (2Ã¢ÂÂ11) and ionic strength (0Ã¢ÂÂ25% NaCl) stability. Excellent emulsifier activity was also verified, suggesting a potential application in enhanced oil recovery. Galactooligosaccharides synthesized by the Kluyveromyces genus have been extensively investigated, however, few studies have reported transgalactosylation ability by other yeast genera. The transgalactosylation activity of the yeast biomass at optimized conditions from 40% (w/w) lactose resulted in galactooligosaccharides production of 73.12g/L and a yield of 18.28% (w/w) at pH 8.0 and 30ÃÂ°C in 24h. This research showed the technical feasibility of an integrated process: biosurfactant and GOS production from P. tsukubaensis, which takes advantage of the remarkable metabolism of this microorganism. To the best of our knowledge, this is the first study reporting the potential of P. tsukubaensis to produce two economical biotechnological products of increase interest as an integrated process.' prompt4 = 'Advantages of a fuzzy predictive control algorithm are discussed in the paper. The fuzzy predictive algorithm is a combination of a DMC (Dynamic Matrix Control) algorithm and TakagiÃ¢ÂÂSugeno fuzzy modeling, thus it inherits advantages of both techniques. The algorithm is numerically effective. It is in fact generalization of the standard DMC algorithm widely used in the industry, thus the existing implementations of the DMC algorithm can be extended using the presented fuzzy approach. A simple and easy to apply method of fuzzy predictive control algorithms synthesis is presented in the paper. It can be easy applied also in the case of Multiple Input Multiple Output (MIMO) control plants. Moreover, information about measured disturbance can be included in the algorithms in an easy way. The advantages of the fuzzy predictive control algorithm are demonstrated in the example control systems of two nonlinear chemical reactors: the first oneÃ¢ÂÂwith inverse response and the second oneÃ¢ÂÂa MIMO plant with time delay.' prompt5 = 'BackgroundBack injury is a common place in our society. Up to two-thirds of back injuries have been associated with trunk rotation. However, the torque production ability with a rotated spine and electromyographic activity of trunk muscles in such efforts is poorly understood. Therefore, the objectives of this study are to study torque production capacity of variously rotated and flexed trunk and to measure the EMG of selected trunk muscles in these activities.MethodsNineteen normal young subjects (7 males and 12 females) were recruited. Subjects were stabilized on a posture-stabilizing platform and were instructed to assume a flexed and right rotated posture (20Â°, 40Â° and 60Â° of rotation and 20Â°, 40Â° and 60Â° of flexion) in a random order. The subjects were asked to exert their maximal voluntary contraction in the asymmetric plane of rotationâextension for a period of 5s. The surface EMG of the external and internal obliques, rectus abdominis, latissimus dorsi, erector spinae at the 10th thoracic and 3rd lumbar vertebral levels was recorded bilaterally along with the torque generated.FindingsWhereas the torque generated was significantly affected by both rotation and extension in both genders (P<0.001), the EMG was independent of rotation but affected by flexion in females only (P<0.01). The torques produced by both genders in each of the nine postures was significantly different from each other (P<0.001). The EMG demonstrated a trend of increase with increasing rotation and flexion. The response surfaces of normalized peak EMG of the right external oblique and internal oblique was somewhat similar, indicating a rotator torque and a stabilizing effect. The left latissimus dorsi and right external oblique provided the rotational torque and the right erector spinae provided the extensor effort. Since the rotationâextension was performed in the plane of asymmetry, the effort required the recruitment of muscles involved in left rotation, stability of rotated spine and an extensor effort.InterpretationThe torque production capacity of the human trunk is posture dependent and declines with increasing rotation. However, with increasing rotation and flexion, the magnitude of EMG increases. This implies that with increasing asymmetry, it requires more muscle effort (thus tissue stress) to generate less torque. Increasing asymmetry tends to weaken the system and may enhance chances of injury.' prompt6 = 'Orthogonal frequency division multiplexing (OFDM) is a promising candidate for light emitting diode (LED)-based optical wireless communication (OWC); however, precise channel estimation is required for synchronization and equalization. In this work, we study and discover that the channel response of the white-lightLED-based OWC was smooth and stable. Hence we propose and demonstrate using a specific and adaptive arrangement of grid-type pilot scheme to estimate the LED OWC channel response. Experimental results show that our scheme can achieve better transmission performance and with some transmission capacity enhancement when compared with the method using training-symbol scheme (also called block-type pilot scheme).' prompt7 = 'The catalytic activities of three nano-sized nickel catalysts Ni/Y2O3, Ni/La2O3 and Ni/Al2O3, using nickel oxalate as precursor and by impregnationâdecompositionâreduced method, have been investigated for the reactions of steam reforming of ethanol at low temperature. Properties of structure and surface of catalysts were tested by XRD, XPS, XES, SEM and BET area. The initial reaction kinetics of ethanol over the catalysts was studied by steady-state reaction and a first-order reaction with respect to ethanol was found. It is found that the catalysts Ni/Y2O3 and Ni/La2O3 exhibit relative high activity for ethanol steam reforming at 250âC with a conversion of ethanol of 81.9% and 80.7%, and a selectivity of hydrogen of 43.1% and 49.5%, respectively. When temperature reached 320âC, the conversion of ethanol increased to 93.1% and 99.5% and the selectivity of hydrogen was 53.2% and 48.5%, respectively. The catalyst Ni/Al2O3 exhibits relative lower activity for ethanol steam reforming and hydrogen selectivity. However, the three catalysts all have long-term stability for ethanol steam reforming.' prompt8 = 'Exergetic and exergoeconomic analyses are often used to evaluate the performance of energy systems from the thermodynamic and economic points of view. While a conventional exergetic analysis can be used to recognize the sources of inefficiencies, the so-called advanced exergy-based analysis is convenient for identifying the real potential for thermodynamic improvements and the system component interactions by splitting the exergy destruction and the total operating cost within each component into endogenous/exogenous and unavoidable/avoidable parts. In this study for the first time an advanced exergoeconomic analysis is applied to a gas-engine-driven heat pump (GEHP) drying system used in food drying for evaluating its performance along with each component. The advanced exergoeconomic analysis shows that the unavoidable part of the exergy destruction cost rate within the components of the system is lower than the avoidable part. The most important components based on the total avoidable costs are drying ducts, the condenser and the expansion valve. The inefficiencies within the condenser could particularly be improved by structural improvements of the whole system and the remaining system components. Finally, it can be concluded that the internal design changes play a more essential role in determining the cost of each component.' self.data_list = [prompt1, prompt2, prompt3, prompt4, prompt5, prompt6, prompt7, prompt8] else: prompt1 = 'State estimation (SE) is well-established at the transmission system level of the electricity grid, where it has been in use for the last few decades and is a most vital component of energy management systems employed in the monitoring and control centers of electric transmission systems. However, its use for the monitoring and control of power distribution systems (DSs) has not yet been widely implemented because DSs have been majorly passive with uni-directional power flows. This scenario is now changing with the advent of smart grid, which is changing the nature of electric distribution networks by embracing more dispersed generation, demand responsive loads, and measurements devices with different data rates. Thus, the development of distribution system state estimation (DSSE) tool is inevitable for the implementation of protection, optimization, and control techniques, and various other features envisioned by the smart grid concept. Due to the inherent characteristics of DS different from those of transmission systems, transmission system state estimation (TSSE) is not applicable directly to DSs. This paper is an attempt to present the state-of-the-art on DSSE as an enabler function for smart grid features. It broadly reviews the development of DSSE, challenges faced by its development, and various DSSE algorithms. Additionally, it identifies some future research lines for DSSE.' prompt2 = 'A solar-assisted absorption heat transformer (SAAHT) is a useful substitute for the conventional equipment to generate low-pressure steam. A 100kW steam generation system with a SAAHT in Langfang (China) is evaluated in this study. Hourly thermodynamic performance, including system efficiency, exergy efficiency, CO2 emission reduction rate and output heat in typical days in four seasons, is discussed. Results show that ambient temperature has a smaller effect on system performance than solar irradiation. In any one of the typical days in spring, summer and autumn, the system presents higher output heat and CO2 emission reduction rate, more stable system efficiency and exergy efficiency than those in winter. Comparative results from two methods show that ratio method has higher system efficiency with solar irradiation below 600W/m2. A hybrid method combining both the degree method and ratio method is adopted to work with the off-design condition, and results show that performance improvement for system is not so obvious as that in solo absorption heat transformer. Among the four typical days, the most obvious improvement occurs in summer with cumulative output heat increasing from 1318kWh to 1343kWh, and the CO2 emission reduction increasing from 296kg to 301kg.' prompt3 = 'The European Commission is encouraging the Cement, Lime and Magnesium Oxide Manufacturing Industries to reutilize collected particulate matter or wastes in the emission control of SO2 with a 100% removal efficiency. Following this directive, three different by-products from the calcination of natural magnesite were selected in order to evaluate their desulfurization capacity. The saturation time, defined as the time for the total neutralization of SO2 was used to determine consumption values at laboratory scale with 100% removal efficiency. The by-product LG-MgO (â¼68% MgO) presented the lowest consumption value, with 2.9kg per m3 of SO2, three times the corresponding to the widely used high grade Ca(OH)2. The liquid-to-gas (L/G) ratio was used for comparison to the industry and taking this into account, the final pH range before achieving saturation was 5.1â6.3. The residual solids obtained at the end of the process were mainly composed of unreacted magnesium and calcium compounds and reaction products CaSO4Â·2H2O and MgSO4Â·6H2O which can be used as fertilizers. Therefore, the reutilization of these by-products in a wet flue gas desulfurization process is a feasible and sustainable choice that allows extending their life-cycle.' prompt4 = 'The GP Joule Group is starting the biggest âgreenâ hydrogen mobility project in Germany so far, close to the northern border with Denmark. The eFarm project will create a modular hydrogen infrastructure, covering production and processing through to utilisation in a number of hydrogen-powered vehicles.' self.data_list = [prompt1, prompt2, prompt3, prompt4] def __len__(self): return len(self.data_list) def __getitem__(self, idx): if torch.is_tensor(idx): idx = idx.tolist() abstract_text = self.data_list[idx] return abstract_text def sd_data_collator(dataset_samples_list): tokenizer = GPT2Tokenizer.from_pretrained('gpt2', padding_side='right') tokenizer.pad_token = tokenizer.eos_token encoded_results = tokenizer(dataset_samples_list, padding=True, truncation=True, return_tensors='pt', return_attention_mask=True) batch = {} batch['input_ids'] = torch.stack([result for result in encoded_results['input_ids']]) batch['past'] = None batch['attention_mask'] = torch.stack([result for result in encoded_results['attention_mask']]) batch['position_ids'] = None batch['head_mask'] = None batch['inputs_embeds'] = None batch['labels'] = None batch['use_cache'] = True return batch output_dir = 'TMP_DIR' logging_dir = 'TMP_DIR' training_args = TrainingArguments( output_dir=output_dir, logging_dir=logging_dir, do_train=True, per_device_train_batch_size=2, num_train_epochs=1, ) model = GPT2LMHeadModel.from_pretrained('gpt2') train_dataset = SDAbstractsDataset('train') trainer = Trainer( model=model, args=training_args, train_dataset=train_dataset, data_collator=sd_data_collator ) trainer.train() ```

https://stackoverflow.com/questions/63924567/gpt2-on-hugging-facepytorch-transformers-runtimeerror-grad-can-be-implicitly

wantRUC · May 14, 2025, 8:55am

Thanks a lot! I will try it.

Topic		Replies	Views
Gpt-neo inference with Deepspeed: IndexError: Dimension out of range Beginners	0	484	August 10, 2021
[Deepspeed] ZeRO-Infinity integration released and config changes DeepSpeed	2	2317	April 28, 2021
No module named 'deepspeed.checkpoint.utils' DeepSpeed	6	2154	June 28, 2023
LoRA finetuning without quantization (8bit) 🤗Transformers	1	991	February 23, 2024
Checkpoint breaks with deepspeed 🤗Transformers	6	3476	March 20, 2021

Issues about deepspeed&Qlora&SFT: RuntimeError: grad can be implicitly created only for scalar outputs

the dataset：

Related topics