How to implement Trainer's 'group_by_length' in PyTorch?

I am working on an ASR project, where I use a model from HuggingFace ( wav2vec2 ). My goal for now is to move the training process to PyTorch, so I am trying to recreate everything that HuggingFace’s Trainer() class offers.

One of these utilities is the ability to group batches by length and combine this with dynamic padding (via a data collator). To be honest however, I am not sure how to even begin this in PyTorch.

The inputs in my case are 1-D arrays that represent the raw waveform of a .wav file. So before training I need to ensure that arrays of similar size will be batched together. Do I need to create a custom Dataloader class and alter it, so that every time it gives me batch sizes of lengths as close as possible?

An idea I had, was to somehow sort the data from shortest to longest (or the opposite), and each time extract batch_size samples from them. This way, the first batch will consist of samples with the biggest lengths, the second batch will have the second biggest lengths, etc.

Nevertheless, I am not sure how to approach this implementation. Any advice will be greatly appreciated.

Thanks in advance.

I know it’s a very late reply, but I wanted to implement same thing and this is how I did it:

from transformers.trainer_pt_utils import LengthGroupedSampler

train_sampler = LengthGroupedSampler(dataset=dataset['train'],
                                     batch_size=batch_size)

train_loader = DataLoader(
    dataset['train'], 
    collate_fn=data_collator,
    sampler=train_sampler,
    batch_size=batch_size
)

Hope someone finds it helpful.

1 Like