Huggingface Data Collator: Index put requires the source and destination dtypes match, got Float for the destination and Long for the source

I am trying to run the standard Huggingface pipeline to pre-train BERT on my dataset. Here is the error when I attempt to train the model:[by calling trainer.train()]

[Error message]

The following columns in the training set don't have a corresponding argument in `BertForMaskedLM.forward` and have been ignored: text, special_tokens_mask. If text, special_tokens_mask are not expected by `BertForMaskedLM.forward`,  you can safely ignore this message.
/usr/local/lib/python3.8/dist-packages/transformers/ FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
***** Running training *****
  Num examples = 5596
  Num Epochs = 5
  Instantaneous batch size per device = 10
  Total train batch size (w. parallel, distributed & accumulation) = 80
  Gradient Accumulation steps = 8
  Total optimization steps = 350
  Number of trainable parameters = 109514298
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
RuntimeError                              Traceback (most recent call last)
<ipython-input-27-2820d0f34efe> in <module>
      1 # train the model
----> 2 trainer.train()

7 frames
/usr/local/lib/python3.8/dist-packages/transformers/data/ in torch_mask_tokens(self, inputs, special_tokens_mask)
    776         indices_random = torch.bernoulli(torch.full(labels.shape, 0.5)).bool() & masked_indices & ~indices_replaced
    777         random_words = torch.randint(len(self.tokenizer), labels.shape, dtype=torch.long)
--> 778         inputs[indices_random] = random_words[indices_random]
    780         # The rest of the time (10% of the time) we keep the masked input tokens unchanged

RuntimeError: Index put requires the source and destination dtypes match, got Float for the destination and Long for the source.

These are my Data Collator and Training arguments:

# initialize the data collator, randomly masking 20% (default is 15%) of the tokens for the Masked Language
# Modeling (MLM) task
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.2
training_args = TrainingArguments(
    output_dir=model_path,          # output directory to where save model checkpoint
    evaluation_strategy="steps",    # evaluate each `logging_steps` steps
    num_train_epochs=5,            # number of training epochs, feel free to tweak
    per_device_train_batch_size=10, # the training batch size, put it as high as your GPU memory fits
    gradient_accumulation_steps=8,  # accumulating the gradients before updating the weights
    per_device_eval_batch_size=64,  # evaluation batch size
    logging_steps=1000,             # evaluate, log and save model checkpoints every 1000 step
    load_best_model_at_end=True,  # whether to load the best model (in terms of loss) at the end of training
    #save_total_limit=3,           # whether you don't have much space so you let only 3 model weights saved in the disk

I am using BertWordPieceTokenizer.


  • Python 3.8
  • Pytorch 1.13.1+cu117
  • Ubuntu 20.04
  • Transformers 4.26.1

I used the same code to pre-train BERT three months ago and everything seemed to work perfectly. Is this an issue generated by any recent update in Huggingface library?

I have also tried with the OSCAR dataset provided by Huggingface but the issue seems persistent. Type-casting the tensors in to make both long (or float) runs into other errors. Does anyone know how to solve this?


Im facing the same issue, were you able to find a solution?

Not really. Still waiting for a fix!

Would you happen to remember the last version number for which this worked?

Unfortunately, I don’t. However, it worked in early November 2022 with the latest version of transformers at that time.

I’m also facing the same issue, any solution?

Still bugging me!

Hi all, I had the exact same error. In my case it had to do with the input length of my training samples which where greater then 512. I used this as a workaround:

def encode_with_truncation(examples):
return tokenizer(examples[“text”], truncation=True, padding=“max_length”,
max_length=max_length, return_special_tokens_mask=True)

def encode_without_truncation(examples):
return tokenizer(examples[“text”], return_special_tokens_mask=True)

encode = encode_with_truncation if truncate_longer_samples else encode_without_truncation

train_dataset = d[“train”].map(encode, batched=True)
test_dataset = d[“test”].map(encode, batched=True)

if truncate_longer_samples:
train_dataset.set_format(type=“torch”, columns=[“input_ids”, “attention_mask”])
test_dataset.set_format(type=“torch”, columns=[“input_ids”, “attention_mask”])
test_dataset.set_format(columns=[“input_ids”, “attention_mask”, “special_tokens_mask”])
train_dataset.set_format(columns=[“input_ids”, “attention_mask”, “special_tokens_mask”])

and set truncate_longer_samples=True


@gabriead I would be interested in seeing how this issue was caused by sequences not being truncated. I’m having this issue myself with certain choices of hyperparameters, but I’m having a bit of difficulty reproducing it, so I’m just trying to gather all the information possible :).

it seems what gabriead said was the issue, at least in my case. I was following this article and and set the truncate_longer_samples to true and the problem went away.

Bert has a 512 token limit so red to truncate or split input.

From the Bert docs…

The only constrain is that the result with the two “sentences” has a combined length of less than 512 tokens.