I am using BERT to classify text that (in many cases) greatly exceeds 512 tokens. What I am doing is splititng the text into segments of 512 tokens and using those as the training samples instead. However, the segments pertaining to the same sequence will of course be quite similar to one another and therefore some bias towards longer sequences is introduced. And assuming that the huggingface library randomly samples from the training set, segments from the longer sequences are more likely to be chosen.
What I want to do is override the ordering such that a segment from each sequence is used before sampling another segment from a sequence that has already been used. For shorter sequences, it is okay if the same segment needs to be sampled multiple times.
For example if I have three sequences, each composed of three segments, say: [[‘a’, ‘b’, ‘c’], [‘d’, ‘e’, ‘f’], [‘g’, ‘h’, ‘i’]], then I would want a potential training order to be something like:
‘b’, ‘f’, ‘i’, ‘a’, ‘d’, ‘h’, ‘c’, ‘e’, ‘g’
How can I accomplish such a task?