I am working on an ASR project, where I use a model from HuggingFace ( wav2vec2
). My goal for now is to move the training process to PyTorch, so I am trying to recreate everything that HuggingFace’s Trainer()
class offers.
One of these utilities is the ability to group batches by length and combine this with dynamic padding (via a data collator). To be honest however, I am not sure how to even begin this in PyTorch.
The inputs in my case are 1-D arrays that represent the raw waveform of a .wav file. So before training I need to ensure that arrays of similar size will be batched together. Do I need to create a custom Dataloader class and alter it, so that every time it gives me batch sizes of lengths as close as possible?
An idea I had, was to somehow sort the data from shortest to longest (or the opposite), and each time extract batch_size samples from them. This way, the first batch will consist of samples with the biggest lengths, the second batch will have the second biggest lengths, etc.
Nevertheless, I am not sure how to approach this implementation. Any advice will be greatly appreciated.
Thanks in advance.