I am trying to run the standard Huggingface pipeline to pre-train BERT on my dataset. Here is the error when I attempt to train the model:[by calling trainer.train()]
[Error message]
The following columns in the training set don't have a corresponding argument in `BertForMaskedLM.forward` and have been ignored: text, special_tokens_mask. If text, special_tokens_mask are not expected by `BertForMaskedLM.forward`, you can safely ignore this message.
/usr/local/lib/python3.8/dist-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
warnings.warn(
***** Running training *****
Num examples = 5596
Num Epochs = 5
Instantaneous batch size per device = 10
Total train batch size (w. parallel, distributed & accumulation) = 80
Gradient Accumulation steps = 8
Total optimization steps = 350
Number of trainable parameters = 109514298
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-27-2820d0f34efe> in <module>
1 # train the model
----> 2 trainer.train()
7 frames
/usr/local/lib/python3.8/dist-packages/transformers/data/data_collator.py in torch_mask_tokens(self, inputs, special_tokens_mask)
776 indices_random = torch.bernoulli(torch.full(labels.shape, 0.5)).bool() & masked_indices & ~indices_replaced
777 random_words = torch.randint(len(self.tokenizer), labels.shape, dtype=torch.long)
--> 778 inputs[indices_random] = random_words[indices_random]
779
780 # The rest of the time (10% of the time) we keep the masked input tokens unchanged
RuntimeError: Index put requires the source and destination dtypes match, got Float for the destination and Long for the source.
These are my Data Collator and Training arguments:
# initialize the data collator, randomly masking 20% (default is 15%) of the tokens for the Masked Language
# Modeling (MLM) task
data_collator = DataCollatorForLanguageModeling(
tokenizer=tokenizer, mlm=True, mlm_probability=0.2
)
training_args = TrainingArguments(
output_dir=model_path, # output directory to where save model checkpoint
evaluation_strategy="steps", # evaluate each `logging_steps` steps
overwrite_output_dir=True,
num_train_epochs=5, # number of training epochs, feel free to tweak
per_device_train_batch_size=10, # the training batch size, put it as high as your GPU memory fits
gradient_accumulation_steps=8, # accumulating the gradients before updating the weights
per_device_eval_batch_size=64, # evaluation batch size
logging_steps=1000, # evaluate, log and save model checkpoints every 1000 step
save_steps=5000,
load_best_model_at_end=True, # whether to load the best model (in terms of loss) at the end of training
#save_total_limit=3, # whether you don't have much space so you let only 3 model weights saved in the disk
)
I am using BertWordPieceTokenizer.
System:
- Python 3.8
- Pytorch 1.13.1+cu117
- Ubuntu 20.04
- Transformers 4.26.1
I used the same code to pre-train BERT three months ago and everything seemed to work perfectly. Is this an issue generated by any recent update in Huggingface library?
I have also tried with the OSCAR dataset provided by Huggingface but the issue seems persistent. Type-casting the tensors in data_collator.py to make both long (or float) runs into other errors. Does anyone know how to solve this?
Thanks.