How to implement Trainer's 'group_by_length' in PyTorch?

ThomasG · August 14, 2021, 10:43am

I am working on an ASR project, where I use a model from HuggingFace ( wav2vec2 ). My goal for now is to move the training process to PyTorch, so I am trying to recreate everything that HuggingFace’s Trainer() class offers.

One of these utilities is the ability to group batches by length and combine this with dynamic padding (via a data collator). To be honest however, I am not sure how to even begin this in PyTorch.

The inputs in my case are 1-D arrays that represent the raw waveform of a .wav file. So before training I need to ensure that arrays of similar size will be batched together. Do I need to create a custom Dataloader class and alter it, so that every time it gives me batch sizes of lengths as close as possible?

An idea I had, was to somehow sort the data from shortest to longest (or the opposite), and each time extract batch_size samples from them. This way, the first batch will consist of samples with the biggest lengths, the second batch will have the second biggest lengths, etc.

Nevertheless, I am not sure how to approach this implementation. Any advice will be greatly appreciated.

Thanks in advance.

maiia-bocharova · September 25, 2023, 6:05pm

I know it’s a very late reply, but I wanted to implement same thing and this is how I did it:

from transformers.trainer_pt_utils import LengthGroupedSampler

train_sampler = LengthGroupedSampler(dataset=dataset['train'],
                                     batch_size=batch_size)

train_loader = DataLoader(
    dataset['train'], 
    collate_fn=data_collator,
    sampler=train_sampler,
    batch_size=batch_size
)

Hope someone finds it helpful.

Topic		Replies	Views
Set_transform and group_by_length=True 🤗Datasets	3	3336	June 10, 2021
Data sampler based on number of tokens 🤗Transformers	0	743	February 4, 2022
Efficient bucketing implementation 🤗Datasets	4	3646	May 16, 2022
It takes so long before the model start training, wav2vec2 fine-tuning 🤗Transformers	2	2240	April 12, 2021
Padding in datasets 🤗Datasets	6	5101	October 21, 2021

How to implement Trainer's 'group_by_length' in PyTorch?

Related topics