Customizing the ordering of training samples

I am using BERT to classify text that (in many cases) greatly exceeds 512 tokens. What I am doing is splititng the text into segments of 512 tokens and using those as the training samples instead. However, the segments pertaining to the same sequence will of course be quite similar to one another and therefore some bias towards longer sequences is introduced. And assuming that the huggingface library randomly samples from the training set, segments from the longer sequences are more likely to be chosen.

What I want to do is override the ordering such that a segment from each sequence is used before sampling another segment from a sequence that has already been used. For shorter sequences, it is okay if the same segment needs to be sampled multiple times.

For example if I have three sequences, each composed of three segments, say: [[‘a’, ‘b’, ‘c’], [‘d’, ‘e’, ‘f’], [‘g’, ‘h’, ‘i’]], then I would want a potential training order to be something like:
‘b’, ‘f’, ‘i’, ‘a’, ‘d’, ‘h’, ‘c’, ‘e’, ‘g’

How can I accomplish such a task?


What you can do is the same as what is explained in this tutorial: using a sliding window approach. This means that you create multiple training examples for a given text, by sliding a window (with some overlap) across the text. You can then label each training example with the label of the text. In this way, you have multiple training examples.

You just need to add an additional return_overflowing_tokens=True when calling the tokenizer.

Next, you can create a standard PyTorch dataloader, setting shuffle=True. This will automatically randomize all training examples, either coming from the same text or not.