Using truncated fragments as input samples in training

Hi!

I am using the tokenizers library, roughly following the run_mlm.py script to train a Masked Language Model (MobileBert) from scratch.

Since I am training an unsupervised model using truncated sentences, I was wondering if the truncated (left-out) fragments are included by default in the dataset for training since they would be valid examples for my use case (MLM setting). If they are not used (which I believe to be the case) I wanted to ask if there is any easy way in which I might include them in my training dataset (maybe by using return_overflowing_tokens and stride in a smart way?).

As an additional related question, I would like to know if there is any native way of sorting by length before batching to reduce the dataset size to the minimum. Something along these lines: pommedeterresautee gist and McCormickML blogpost.

EDIT: The best way I have found to do the smart batching is to create an ā€˜sample_lengthā€™ column and use the .sort method to sort by that column before tokenizing.

Thanks in advance!

By default, those are not included (unless you use the --line_by_line option which will concatenate all the samples then create block of the size you picked). Using return_overflowing_tokens is definitely an option to get those truncated part! stride is only if you want some overlap between the two parts of a long sentence, which is useful for question answering, but not necessarily for masked language modeling pretraining.

For the sorting by length before batching, we have the --group_by_length option in the Trainer, though itā€™s for the dataset so it happens after tokenization, which may not be what you are looking for.

Thanks Sylvain! As per your video, I understand that the --group_by_length option is compatible with the DataCollatorWithPadding. Is it also compatible with the DataCollatorForLanguageModelling?
I understand it is, since according to the docs the DataCollatorForLanguageModelling dynamically pads to make batches even.

Yes, itā€™s compatible with any data collator, it changes the sampler of the dataset only.

1 Like