Hi!
I am using the tokenizers
library, roughly following the run_mlm.py script to train a Masked Language Model (MobileBert
) from scratch.
Since I am training an unsupervised model using truncated sentences, I was wondering if the truncated (left-out) fragments are included by default in the dataset for training since they would be valid examples for my use case (MLM setting). If they are not used (which I believe to be the case) I wanted to ask if there is any easy way in which I might include them in my training dataset (maybe by using return_overflowing_tokens
and stride
in a smart way?).
As an additional related question, I would like to know if there is any native way of sorting by length before batching to reduce the dataset size to the minimum. Something along these lines: pommedeterresautee gist and McCormickML blogpost.
EDIT: The best way I have found to do the smart batching is to create an āsample_lengthā column and use the .sort
method to sort by that column before tokenizing.
Thanks in advance!