I have the following error when running trl sfttrainer for supervised fine-tuning task:
thread: [74,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [20,0,0], thread: [75,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [20,0,0], thread: [76,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [20,0,0], thread: [77,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [20,0,0], thread: [78,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [20,0,0], thread: [79,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [20,0,0], thread: [80,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [20,0,0], thread: [81,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [20,0,0], thread: [82,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [20,0,0], thread: [83,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [20,0,0], thread: [84,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [20,0,0], thread: [85,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [20,0,0], thread: [86,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [20,0,0], thread: [87,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [20,0,0], thread: [88,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [20,0,0], thread: [89,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [20,0,0], thread: [90,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [20,0,0], thread: [91,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [20,0,0], thread: [92,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [20,0,0], thread: [93,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [20,0,0], thread: [94,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [20,0,0], thread: [95,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
Traceback (most recent call last):
File "/home/zy1130/agentscope/examples/small_llms_nscc/small_llms_finetuning_ToolBenchPlanner.py", line 127, in <module>
main()
File "/home/zy1130/agentscope/examples/small_llms_nscc/small_llms_finetuning_ToolBenchPlanner.py", line 82, in main
dialog_agent = Finetune_DialogAgent(
File "/home/zy1130/agentscope/src/agentscope/agents/agent.py", line 82, in __call__
instance = super().__call__(*args, **kwargs)
File "/home/zy1130/agentscope/examples/small_llms_nscc/finetune_dialogagent.py", line 49, in __init__
super().__init__(
File "/home/zy1130/agentscope/src/agentscope/agents/dialog_agent.py", line 45, in __init__
super().__init__(
File "/home/zy1130/agentscope/src/agentscope/agents/agent.py", line 195, in __init__
self.model = load_model_by_config_name(model_config_name)
File "/home/zy1130/agentscope/src/agentscope/models/__init__.py", line 117, in load_model_by_config_name
return _get_model_wrapper(model_type=model_type)(**kwargs)
File "/home/zy1130/agentscope/examples/small_llms_nscc/huggingface_model.py", line 107, in __init__
self.model = self.fine_tune_training(
File "/home/zy1130/agentscope/examples/small_llms_nscc/huggingface_model.py", line 664, in fine_tune_training
trainer.train()
File "/home/zy1130/anaconda3/envs/agentscope/lib/python3.10/site-packages/trl/trainer/sft_trainer.py", line 440, in train
output = super().train(*args, **kwargs)
File "/home/zy1130/anaconda3/envs/agentscope/lib/python3.10/site-packages/transformers/trainer.py", line 1885, in train
return inner_training_loop(
File "/home/zy1130/anaconda3/envs/agentscope/lib/python3.10/site-packages/transformers/trainer.py", line 2216, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/home/zy1130/anaconda3/envs/agentscope/lib/python3.10/site-packages/transformers/trainer.py", line 3238, in training_step
loss = self.compute_loss(model, inputs)
File "/home/zy1130/anaconda3/envs/agentscope/lib/python3.10/site-packages/transformers/trainer.py", line 3264, in compute_loss
outputs = model(**inputs)
File "/home/zy1130/anaconda3/envs/agentscope/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/zy1130/anaconda3/envs/agentscope/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/home/zy1130/anaconda3/envs/agentscope/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward
output = module._old_forward(*args, **kwargs)
File "/home/zy1130/anaconda3/envs/agentscope/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 1164, in forward
outputs = self.model(
File "/home/zy1130/anaconda3/envs/agentscope/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/zy1130/anaconda3/envs/agentscope/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/home/zy1130/anaconda3/envs/agentscope/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 934, in forward
cache_position = torch.arange(
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
by default sfttrainer will use multiple gpus if available. If I force it to use only 1 gpu with
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
the error goes away. If I set per_device_train_batch_size
to 1, the error also goes away. I tried on two different fine-tuning dataset from huggingface and both gave this error. Any idea why this occurs and how to address it? I found debugging hard because setting os.environ['CUDA_LAUNCH_BLOCKING'] = '1'
the training simply won’t start and does not give any error, plus the error occurred within sfttrainer’s code which is not exposed to users.