Getting torch.cuda.halfTensor error while using DeepSpeed with accelerate

I am trying to train a GPT-2 large language model using Deepspeed plugin of accelerate. Every time when the script calls this line,
loss = model(torch.stack(batch['input_ids']).T,labels=torch.stack(batch['input_ids']).T).loss , I get the following error:
RuntimeError: Expected tensor for argument #1 'indices' to have one of the following scalar types: Long, Int; but got torch.cuda.HalfTensor instead (while checking arguments for embedding) inputs_embeds = self.wte(input_ids).
I also checked the data type of torch.stack(batch['input_ids']).T. It is torch.Long.

For reference I am providing my accelerate config file parameters and a more descriptive error stacktrace.
Configfile:

deepspeed_config:

gradient_accumulation_steps: 1

gradient_clipping: 1.0

offload_optimizer_device: none

offload_param_device: none

zero3_init_flag: true

zero_stage: 2

distributed_type: DEEPSPEED

fsdp_config: {}

machine_rank: 0

main_process_ip: null

main_process_port: null

main_training_function: main

mixed_precision: fp16

num_machines: 1

num_processes: 4

use_cpu: false

Detailed error:

File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-48d0be80-7221-441b-911a-c6bb2ef50dc5/lib/python3.9/site-packages/torch/nn/functional.py", line 2210, in embedding
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-48d0be80-7221-441b-911a-c6bb2ef50dc5/lib/python3.9/site-packages/torch/nn/functional.py", line 2210, in embedding
transformer_outputs = self.transformer(
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-48d0be80-7221-441b-911a-c6bb2ef50dc5/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-48d0be80-7221-441b-911a-c6bb2ef50dc5/lib/python3.9/site-packages/torch/nn/modules/sparse.py", line 160, in forward
return F.embedding(
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-48d0be80-7221-441b-911a-c6bb2ef50dc5/lib/python3.9/site-packages/torch/nn/functional.py", line 2210, in embedding
return forward_call(*input, **kwargs)
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-48d0be80-7221-441b-911a-c6bb2ef50dc5/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 843, in forward
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeErrorreturn torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse):
Expected tensor for argument #1 'indices' to have one of the following scalar types: Long, Int; but got torch.cuda.HalfTensor instead (while checking arguments for embedding)
RuntimeError: Expected tensor for argument #1 'indices' to have one of the following scalar types: Long, Int; but got torch.cuda.HalfTensor instead (while checking arguments for embedding)
inputs_embeds = self.wte(input_ids)
  • Can you share your code or some re-usable example to test this.
  • Also can you see what is being passed into your input when you are running by printing that.

Hi @hsuyab thanks for replying. yes I tested the output by printing. It is torch.Long(). I think since it is ‘fp16’ deepseed is converting the input_ids to torch.cuda.half_tensor. I cannot share the code as it is my comapny’s internal code but what I can share is the output of ds_report. Is it possible for you to check whether this is okay:

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 but detected 2.0
 [WARNING]  using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/local_disk0/.ephemeral_nfs/envs/pythonEnv-484d6d7c-f559-4582-aa77-2c164373abce/lib/python3.9/site-packages/torch']
torch version .................... 2.0.1+cu117
deepspeed install path ........... ['/local_disk0/.ephemeral_nfs/envs/pythonEnv-484d6d7c-f559-4582-aa77-2c164373abce/lib/python3.9/site-packages/deepspeed']
deepspeed info ................... 0.8.3, unknown, unknown
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 11.3
deepspeed wheel compiled w. ...... torch 1.13, cuda 11.7

Having this exact problem. Were you able to solve?

I solved by changing accelerate config to not use fp16 (stage 2)

Yeah, it works by removing fp16, but then Out of Memory error start coming. FP16 would be required for larger batch sizes. I am also stuck and unable to find a solution. I will probably stop using accelerate and directly use DeepSpeed library with pytorch.

hi,i fixed this problem by change my deepspeed config file.
They key is set <auto_cast> to false, then you will find there no longer has the problem of “LongTensor be tranfered to HalfTensor”

1 Like

you can solve it in my answer

is this still an issue?