I am using the
tokenizers library, roughly following the run_mlm.py script to train a Masked Language Model (
MobileBert) from scratch.
Since I am training an unsupervised model using truncated sentences, I was wondering if the truncated (left-out) fragments are included by default in the dataset for training since they would be valid examples for my use case (MLM setting). If they are not used (which I believe to be the case) I wanted to ask if there is any easy way in which I might include them in my training dataset (maybe by using
stride in a smart way?).
As an additional related question, I would like to know if there is any native way of sorting by length before batching to reduce the dataset size to the minimum. Something along these lines: pommedeterresautee gist and McCormickML blogpost.
EDIT: The best way I have found to do the smart batching is to create an ‘sample_length’ column and use the
.sort method to sort by that column before tokenizing.
Thanks in advance!