CUDA error that only occurs on multiple gpus when doing batched training

I have the following error when running trl sfttrainer for supervised fine-tuning task:

 thread: [74,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [20,0,0], thread: [75,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [20,0,0], thread: [76,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [20,0,0], thread: [77,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [20,0,0], thread: [78,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [20,0,0], thread: [79,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [20,0,0], thread: [80,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [20,0,0], thread: [81,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [20,0,0], thread: [82,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [20,0,0], thread: [83,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [20,0,0], thread: [84,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [20,0,0], thread: [85,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [20,0,0], thread: [86,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [20,0,0], thread: [87,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [20,0,0], thread: [88,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [20,0,0], thread: [89,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [20,0,0], thread: [90,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [20,0,0], thread: [91,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [20,0,0], thread: [92,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [20,0,0], thread: [93,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [20,0,0], thread: [94,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [20,0,0], thread: [95,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
Traceback (most recent call last):
  File "/home/zy1130/agentscope/examples/small_llms_nscc/small_llms_finetuning_ToolBenchPlanner.py", line 127, in <module>
    main()
  File "/home/zy1130/agentscope/examples/small_llms_nscc/small_llms_finetuning_ToolBenchPlanner.py", line 82, in main
    dialog_agent = Finetune_DialogAgent(
  File "/home/zy1130/agentscope/src/agentscope/agents/agent.py", line 82, in __call__
    instance = super().__call__(*args, **kwargs)
  File "/home/zy1130/agentscope/examples/small_llms_nscc/finetune_dialogagent.py", line 49, in __init__
    super().__init__(
  File "/home/zy1130/agentscope/src/agentscope/agents/dialog_agent.py", line 45, in __init__
    super().__init__(
  File "/home/zy1130/agentscope/src/agentscope/agents/agent.py", line 195, in __init__
    self.model = load_model_by_config_name(model_config_name)
  File "/home/zy1130/agentscope/src/agentscope/models/__init__.py", line 117, in load_model_by_config_name
    return _get_model_wrapper(model_type=model_type)(**kwargs)
  File "/home/zy1130/agentscope/examples/small_llms_nscc/huggingface_model.py", line 107, in __init__
    self.model = self.fine_tune_training(
  File "/home/zy1130/agentscope/examples/small_llms_nscc/huggingface_model.py", line 664, in fine_tune_training
    trainer.train()
  File "/home/zy1130/anaconda3/envs/agentscope/lib/python3.10/site-packages/trl/trainer/sft_trainer.py", line 440, in train
    output = super().train(*args, **kwargs)
  File "/home/zy1130/anaconda3/envs/agentscope/lib/python3.10/site-packages/transformers/trainer.py", line 1885, in train
    return inner_training_loop(
  File "/home/zy1130/anaconda3/envs/agentscope/lib/python3.10/site-packages/transformers/trainer.py", line 2216, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/home/zy1130/anaconda3/envs/agentscope/lib/python3.10/site-packages/transformers/trainer.py", line 3238, in training_step
    loss = self.compute_loss(model, inputs)
  File "/home/zy1130/anaconda3/envs/agentscope/lib/python3.10/site-packages/transformers/trainer.py", line 3264, in compute_loss
    outputs = model(**inputs)
  File "/home/zy1130/anaconda3/envs/agentscope/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/zy1130/anaconda3/envs/agentscope/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/zy1130/anaconda3/envs/agentscope/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/home/zy1130/anaconda3/envs/agentscope/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 1164, in forward
    outputs = self.model(
  File "/home/zy1130/anaconda3/envs/agentscope/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/zy1130/anaconda3/envs/agentscope/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/zy1130/anaconda3/envs/agentscope/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 934, in forward
    cache_position = torch.arange(
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

by default sfttrainer will use multiple gpus if available. If I force it to use only 1 gpu with

os.environ['CUDA_VISIBLE_DEVICES'] = '0'

the error goes away. If I set per_device_train_batch_size to 1, the error also goes away. I tried on two different fine-tuning dataset from huggingface and both gave this error. Any idea why this occurs and how to address it? I found debugging hard because setting os.environ['CUDA_LAUNCH_BLOCKING'] = '1' the training simply won’t start and does not give any error, plus the error occurred within sfttrainer’s code which is not exposed to users.