Huggingface Data Collator: Index put requires the source and destination dtypes match, got Float for the destination and Long for the source

xencoder · March 2, 2023, 9:42pm

I am trying to run the standard Huggingface pipeline to pre-train BERT on my dataset. Here is the error when I attempt to train the model:[by calling trainer.train()]

[Error message]

The following columns in the training set don't have a corresponding argument in `BertForMaskedLM.forward` and have been ignored: text, special_tokens_mask. If text, special_tokens_mask are not expected by `BertForMaskedLM.forward`,  you can safely ignore this message.
/usr/local/lib/python3.8/dist-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
***** Running training *****
  Num examples = 5596
  Num Epochs = 5
  Instantaneous batch size per device = 10
  Total train batch size (w. parallel, distributed & accumulation) = 80
  Gradient Accumulation steps = 8
  Total optimization steps = 350
  Number of trainable parameters = 109514298
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-27-2820d0f34efe> in <module>
      1 # train the model
----> 2 trainer.train()

7 frames
/usr/local/lib/python3.8/dist-packages/transformers/data/data_collator.py in torch_mask_tokens(self, inputs, special_tokens_mask)
    776         indices_random = torch.bernoulli(torch.full(labels.shape, 0.5)).bool() & masked_indices & ~indices_replaced
    777         random_words = torch.randint(len(self.tokenizer), labels.shape, dtype=torch.long)
--> 778         inputs[indices_random] = random_words[indices_random]
    779 
    780         # The rest of the time (10% of the time) we keep the masked input tokens unchanged

RuntimeError: Index put requires the source and destination dtypes match, got Float for the destination and Long for the source.

These are my Data Collator and Training arguments:

# initialize the data collator, randomly masking 20% (default is 15%) of the tokens for the Masked Language
# Modeling (MLM) task
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.2
)

training_args = TrainingArguments(
    output_dir=model_path,          # output directory to where save model checkpoint
    evaluation_strategy="steps",    # evaluate each `logging_steps` steps
    overwrite_output_dir=True,      
    num_train_epochs=5,            # number of training epochs, feel free to tweak
    per_device_train_batch_size=10, # the training batch size, put it as high as your GPU memory fits
    gradient_accumulation_steps=8,  # accumulating the gradients before updating the weights
    per_device_eval_batch_size=64,  # evaluation batch size
    logging_steps=1000,             # evaluate, log and save model checkpoints every 1000 step
    save_steps=5000,
    load_best_model_at_end=True,  # whether to load the best model (in terms of loss) at the end of training
    #save_total_limit=3,           # whether you don't have much space so you let only 3 model weights saved in the disk
)

I am using BertWordPieceTokenizer.

System:

Python 3.8
Pytorch 1.13.1+cu117
Ubuntu 20.04
Transformers 4.26.1

I used the same code to pre-train BERT three months ago and everything seemed to work perfectly. Is this an issue generated by any recent update in Huggingface library?

I have also tried with the OSCAR dataset provided by Huggingface but the issue seems persistent. Type-casting the tensors in data_collator.py to make both long (or float) runs into other errors. Does anyone know how to solve this?

Thanks.

aneeshjain · March 9, 2023, 9:02pm

Im facing the same issue, were you able to find a solution?

xencoder · March 9, 2023, 9:41pm

Not really. Still waiting for a fix!

aneeshjain · March 9, 2023, 9:43pm

Would you happen to remember the last version number for which this worked?

xencoder · March 10, 2023, 2:11am

Unfortunately, I don’t. However, it worked in early November 2022 with the latest version of transformers at that time.

Gurjot · May 18, 2023, 8:24am

I’m also facing the same issue, any solution?

xencoder · May 19, 2023, 11:04am

Still bugging me!

gabriead · June 1, 2023, 8:40am

Hi all, I had the exact same error. In my case it had to do with the input length of my training samples which where greater then 512. I used this as a workaround:

def encode_with_truncation(examples):
return tokenizer(examples[“text”], truncation=True, padding=“max_length”,
max_length=max_length, return_special_tokens_mask=True)

def encode_without_truncation(examples):
return tokenizer(examples[“text”], return_special_tokens_mask=True)

encode = encode_with_truncation if truncate_longer_samples else encode_without_truncation

train_dataset = d[“train”].map(encode, batched=True)
test_dataset = d[“test”].map(encode, batched=True)

if truncate_longer_samples:
train_dataset.set_format(type=“torch”, columns=[“input_ids”, “attention_mask”])
test_dataset.set_format(type=“torch”, columns=[“input_ids”, “attention_mask”])
else:
test_dataset.set_format(columns=[“input_ids”, “attention_mask”, “special_tokens_mask”])
train_dataset.set_format(columns=[“input_ids”, “attention_mask”, “special_tokens_mask”])

and set truncate_longer_samples=True

lkurlandski · December 1, 2023, 3:40am

@gabriead I would be interested in seeing how this issue was caused by sequences not being truncated. I’m having this issue myself with certain choices of hyperparameters, but I’m having a bit of difficulty reproducing it, so I’m just trying to gather all the information possible :).

Mayukh · December 10, 2023, 7:54am

it seems what gabriead said was the issue, at least in my case. I was following this article and and set the truncate_longer_samples to true and the problem went away.

panigrah · December 10, 2023, 10:16am

Bert has a 512 token limit so red to truncate or split input.

From the Bert docs…

The only constrain is that the result with the two “sentences” has a combined length of less than 512 tokens.

Topic		Replies	Views
Expected scalar type Long but found Float using Trainer for BertForTokenClassification Beginners	6	4005	April 22, 2021
Cannot get DataCollator to prepare tf dataset 🤗Transformers	0	479	July 15, 2022
Fine tune Masked Language Model on custom dataset Beginners	5	6086	August 20, 2020
Extra Dimension with DataCollatorFor LanguageModeling into BertForMaskedLM? Beginners	7	2026	January 16, 2024
Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length 🤗Transformers	0	1011	October 22, 2022

Huggingface Data Collator: Index put requires the source and destination dtypes match, got Float for the destination and Long for the source

These are my Data Collator and Training arguments:

Related topics