Using truncated fragments as input samples in training

lesscomfortable · June 18, 2021, 6:27pm

Hi!

I am using the tokenizers library, roughly following the run_mlm.py script to train a Masked Language Model (MobileBert) from scratch.

Since I am training an unsupervised model using truncated sentences, I was wondering if the truncated (left-out) fragments are included by default in the dataset for training since they would be valid examples for my use case (MLM setting). If they are not used (which I believe to be the case) I wanted to ask if there is any easy way in which I might include them in my training dataset (maybe by using return_overflowing_tokens and stride in a smart way?).

As an additional related question, I would like to know if there is any native way of sorting by length before batching to reduce the dataset size to the minimum. Something along these lines: pommedeterresautee gist and McCormickML blogpost.

EDIT: The best way I have found to do the smart batching is to create an ‘sample_length’ column and use the .sort method to sort by that column before tokenizing.

Thanks in advance!

sgugger · June 21, 2021, 1:06pm

By default, those are not included (unless you use the --line_by_line option which will concatenate all the samples then create block of the size you picked). Using return_overflowing_tokens is definitely an option to get those truncated part! stride is only if you want some overlap between the two parts of a long sentence, which is useful for question answering, but not necessarily for masked language modeling pretraining.

For the sorting by length before batching, we have the --group_by_length option in the Trainer, though it’s for the dataset so it happens after tokenization, which may not be what you are looking for.

lesscomfortable · June 30, 2021, 9:27pm

Thanks Sylvain! As per your video, I understand that the --group_by_length option is compatible with the DataCollatorWithPadding. Is it also compatible with the DataCollatorForLanguageModelling?
I understand it is, since according to the docs the DataCollatorForLanguageModelling dynamically pads to make batches even.

sgugger · July 1, 2021, 12:38pm

Yes, it’s compatible with any data collator, it changes the sampler of the dataset only.

Topic		Replies	Views
Question about truncate length of tokenizer Beginners	1	1251	September 20, 2022
Data collation: cannot understand the logics of the API 🤗Transformers	0	26	September 2, 2024
Extra Dimension with DataCollatorFor LanguageModeling into BertForMaskedLM? Beginners	7	2018	January 16, 2024
Purpose of padding and truncating Beginners	7	3345	August 3, 2020
It asks to add padding or truncation but I have already done it Beginners	1	829	October 6, 2023

Using truncated fragments as input samples in training

Related topics