Processor :: pad Ignores Padding?

Hello Guys,

i want to Finetune a Wav2Vec Modell, but i cant pad my Inputs.
One Entry of my dataset looks something like this:

{'path': '/share/datasets/voxforge_todo/anonymhatschie-20140526-wth/wav/de3-47.wav',
 'sentence': 'das netzwerkkabel weist die notwendige qualität auf für schnelle datenübertragung',
 'sampling_rate': 16000,
 'speech': array([-0.00012025, -0.00104889, -0.00085019, ..., -0.00066073,
        -0.00037126, -0.00053336], dtype=float32),
 'input_values': tensor([[-0.0019, -0.0166, -0.0134,  ..., -0.0104, -0.0059, -0.0084]]),
 'labels': tensor([[ 58,  54,  24,  37, 151,  51,  18,  43,  52,  51,  47,  61,  61,  54,
           27,  51, 110,  37,  52,  51,  39,  24,  18,  37,  58,  39,  51,  37,
          151,  67,  18,  52,  51, 151,  58,  39,  60,  51,  37,  30, 104,  54,
          110,  39,  18,  17,  18,  37,  54, 104,  19,  37,  19,  23,  47,  37,
           24, 113,  92, 151,  51, 110, 110,  51,  37,  58,  54,  18,  51, 151,
           23,  27,  51,  47,  18,  47,  54,  60, 104, 151,  60]])}

Where speech is the Audiofile read in by torchaudio::load, the target transcription is sentence and

batch["input_values"] = self.processor(batch["speech"], return_tensors="pt", 
    sampling_rate=16_000, padding=True).input_values
batch["labels"] = self.processor(batch["sentence"], return_tensors="pt", padding=True).input_ids

If i use this Dataset as Dataset to Finetune the maxidl/wav2vec2-large-xlsr-german Model im getting the Following Exception:

ValueError: Unable to create tensor, you should probably activate padding with 'padding=True' to have batched tensors with the same length.

This Exception is caused by the following self.process.pad call within the
DataCollatorCTCWithPadding Class as Copied from: https://colab.research.google.com/github/patrickvonplaten/notebooks/blob/master/Fine_Tune_XLSR_Wav2Vec2_on_Turkish_ASR_with_🤗_Transformers.ipynb#scrollTo=lbQf5GuZyQ4_

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lenghts and need
        # different padding methods

        input_features = [{"input_values": feature["input_values"]} for feature in features]
        label_features = [{"input_ids": feature["labels"]} for feature in features]

        batch = self.processor.pad(
            input_features,
            padding=self.padding,
            max_length=self.max_length,
            pad_to_multiple_of=self.pad_to_multiple_of,
            return_tensors="pt",
        )

To Reproduce my Problem with Minimum Amount of Code i did this:

from transformers import Wav2Vec2Processor
model_name = 'maxidl/wav2vec2-large-xlsr-german'
processor = Wav2Vec2Processor.from_pretrained(model_name)

fake_batch = [{'input_values' : torch.rand([1, 112000])}, {'input_values' : torch.rand([1, 86000])}]
processor.pad(fake_batch, padding=True)

And im still getting the following Exception, even tho i passed padding=True

ValueError: Unable to create tensor, you should probably activate padding with 'padding=True' to have batched tensors with the same length.

Maybe you guys can help me :slight_smile:

Ty in advance