Using truncated fragments as input samples in training

Hi!

I am using the tokenizers library, roughly following the run_mlm.py script to train a Masked Language Model (MobileBert) from scratch.

Since I am training an unsupervised model using truncated sentences, I was wondering if the truncated (left-out) fragments are included by default in the dataset for training since they would be valid examples for my use case (MLM setting). If they are not used (which I believe to be the case) I wanted to ask if there is any easy way in which I might include them in my training dataset (maybe by using return_overflowing_tokens and stride in a smart way?).

As an additional related question, I would like to know if there is any native way of sorting by length before batching to reduce the dataset size to the minimum. Something along these lines: pommedeterresautee gist and McCormickML blogpost.

EDIT: The best way I have found to do the smart batching is to create an ‘sample_length’ column and use the .sort method to sort by that column before tokenizing.

Thanks in advance!

By default, those are not included (unless you use the --line_by_line option which will concatenate all the samples then create block of the size you picked). Using return_overflowing_tokens is definitely an option to get those truncated part! stride is only if you want some overlap between the two parts of a long sentence, which is useful for question answering, but not necessarily for masked language modeling pretraining.

For the sorting by length before batching, we have the --group_by_length option in the Trainer, though it’s for the dataset so it happens after tokenization, which may not be what you are looking for.

Thanks Sylvain! As per your video, I understand that the --group_by_length option is compatible with the DataCollatorWithPadding. Is it also compatible with the DataCollatorForLanguageModelling?
I understand it is, since according to the docs the DataCollatorForLanguageModelling dynamically pads to make batches even.

Yes, it’s compatible with any data collator, it changes the sampler of the dataset only.

1 Like