Question about truncate length of tokenizer

xiaoxian · September 20, 2022, 8:42am

I’m using Roberta-large model to train a mask language model.Generally, there are token ‘<mask>’ in the input of a mlm. But what if the input is too long, tokenizer cut the ‘<mask>’ token off . Does this cause a problem? In my opinion, when training, cutting off ‘<mask>’ means this input doesn’t contribute to the loss. When inferencing, if there is no ‘’ in input, mlm doesn’t know what to predict, so it just output the same as input, am I right?
Please forgive my language, not a native speaker.

nbroad · September 20, 2022, 1:15pm

The Hugging Face example scripts will usually not truncate the texts and will instead group the texts. If your max length is 512, and your examples are of sequence length 100, 200, 300, 700, 800, 900, then this will be grouped into 6 chunks of 512. ~~Doing it this way will result in no truncated tokens.~~

Edit: If the total number of tokens is not a multiple of the chunk size, the remainder will be dropped. My original statement was incorrect. e.g. 4096 tokens would get chunked into 8 chunks with no truncation. 5000 tokens would get chunked into 8 chunks and would truncate 4 tokens.

Since 15% of the tokens are masked, it would be very rare that a sequence has no mask tokens. Even if it did, there wouldn’t be any loss for this step because there wouldn’t be any predictions and labels.

Topic		Replies	Views
Sequence Length in Continued Pretraining (MLM) & Masking Strategies Intermediate	0	1180	January 6, 2022
[URGENT] Issues with Training RoBERTa Model for Text Prediction with Fill Mask Task 🤗Transformers	6	216	March 19, 2024
Question on splitting input sequence Beginners	3	5570	June 14, 2022
Sequence masking 🤗Transformers	0	379	April 25, 2022
How truncation works when applying BERT tokenizer on the batch of sentence pairs in HuggingFace? 🤗Tokenizers	0	936	May 15, 2022

Question about truncate length of tokenizer

Related topics