I am trying to run the standard Huggingface pipeline to pre-train BERT on my dataset. Here is the error when I attempt to train the model:[by calling trainer.train()]
The following columns in the training set don't have a corresponding argument in `BertForMaskedLM.forward` and have been ignored: text, special_tokens_mask. If text, special_tokens_mask are not expected by `BertForMaskedLM.forward`, you can safely ignore this message. /usr/local/lib/python3.8/dist-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning warnings.warn( ***** Running training ***** Num examples = 5596 Num Epochs = 5 Instantaneous batch size per device = 10 Total train batch size (w. parallel, distributed & accumulation) = 80 Gradient Accumulation steps = 8 Total optimization steps = 350 Number of trainable parameters = 109514298 You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding. --------------------------------------------------------------------------- RuntimeError Traceback (most recent call last) <ipython-input-27-2820d0f34efe> in <module> 1 # train the model ----> 2 trainer.train() 7 frames /usr/local/lib/python3.8/dist-packages/transformers/data/data_collator.py in torch_mask_tokens(self, inputs, special_tokens_mask) 776 indices_random = torch.bernoulli(torch.full(labels.shape, 0.5)).bool() & masked_indices & ~indices_replaced 777 random_words = torch.randint(len(self.tokenizer), labels.shape, dtype=torch.long) --> 778 inputs[indices_random] = random_words[indices_random] 779 780 # The rest of the time (10% of the time) we keep the masked input tokens unchanged RuntimeError: Index put requires the source and destination dtypes match, got Float for the destination and Long for the source.
# initialize the data collator, randomly masking 20% (default is 15%) of the tokens for the Masked Language # Modeling (MLM) task data_collator = DataCollatorForLanguageModeling( tokenizer=tokenizer, mlm=True, mlm_probability=0.2 )
training_args = TrainingArguments( output_dir=model_path, # output directory to where save model checkpoint evaluation_strategy="steps", # evaluate each `logging_steps` steps overwrite_output_dir=True, num_train_epochs=5, # number of training epochs, feel free to tweak per_device_train_batch_size=10, # the training batch size, put it as high as your GPU memory fits gradient_accumulation_steps=8, # accumulating the gradients before updating the weights per_device_eval_batch_size=64, # evaluation batch size logging_steps=1000, # evaluate, log and save model checkpoints every 1000 step save_steps=5000, load_best_model_at_end=True, # whether to load the best model (in terms of loss) at the end of training #save_total_limit=3, # whether you don't have much space so you let only 3 model weights saved in the disk )
I am using BertWordPieceTokenizer.
- Python 3.8
- Pytorch 1.13.1+cu117
- Ubuntu 20.04
- Transformers 4.26.1
I used the same code to pre-train BERT three months ago and everything seemed to work perfectly. Is this an issue generated by any recent update in Huggingface library?
I have also tried with the OSCAR dataset provided by Huggingface but the issue seems persistent. Type-casting the tensors in data_collator.py to make both long (or float) runs into other errors. Does anyone know how to solve this?