Customizing the ordering of training samples

Gianluca · September 15, 2021, 10:44pm

I am using BERT to classify text that (in many cases) greatly exceeds 512 tokens. What I am doing is splititng the text into segments of 512 tokens and using those as the training samples instead. However, the segments pertaining to the same sequence will of course be quite similar to one another and therefore some bias towards longer sequences is introduced. And assuming that the huggingface library randomly samples from the training set, segments from the longer sequences are more likely to be chosen.

What I want to do is override the ordering such that a segment from each sequence is used before sampling another segment from a sequence that has already been used. For shorter sequences, it is okay if the same segment needs to be sampled multiple times.

For example if I have three sequences, each composed of three segments, say: [[‘a’, ‘b’, ‘c’], [‘d’, ‘e’, ‘f’], [‘g’, ‘h’, ‘i’]], then I would want a potential training order to be something like:
‘b’, ‘f’, ‘i’, ‘a’, ‘d’, ‘h’, ‘c’, ‘e’, ‘g’

How can I accomplish such a task?

nielsr · September 16, 2021, 10:07am

Hi,

What you can do is the same as what is explained in this tutorial: using a sliding window approach. This means that you create multiple training examples for a given text, by sliding a window (with some overlap) across the text. You can then label each training example with the label of the text. In this way, you have multiple training examples.

You just need to add an additional return_overflowing_tokens=True when calling the tokenizer.

Next, you can create a standard PyTorch dataloader, setting shuffle=True. This will automatically randomize all training examples, either coming from the same text or not.

Topic		Replies	Views
Original Bert Pretraining Intermediate	0	546	January 10, 2022
Multi-Task dataset with Custom Sampler and Sharding Intermediate	4	1366	August 1, 2023
Tutorial: Fine-tuning with custom datasets – sentiment, NER, and question answering 🤗Transformers	19	12843	February 12, 2024
Training GPT-2 from scratch Beginners	2	1230	August 3, 2020
Question on splitting input sequence Beginners	3	5585	June 14, 2022

Customizing the ordering of training samples

Related topics