Error with DataCollator for SpeechT5

I am trying to do voice emotion conversion by finetuning a voice conversion speecht5 model. Im getting this error with a custom DataCollator I defined (input_values are the processed audio inputs using SpeechT5 processor, labels are the spectrogram targets):

Cell In [18], line 19, in DataCollatorWithPadding.__call__(self, features)
     16 speaker_features = [feature["speaker_embeddings"] for feature in features]
     18 # collate the inputs and targets into a batch
---> 19 batch = processor.pad(
     20     input_values=input_values,
     21     labels=labels,
     22     return_tensors="pt",
     23     padding=True
     24 )        
     26 # replace padding with -100 to ignore loss correctly
     27 batch["labels"] = batch["labels"].masked_fill(
     28     batch.decoder_attention_mask.unsqueeze(-1).ne(1), -100
     29 )

File /usr/local/lib/python3.9/dist-packages/transformers/models/speecht5/processing_speecht5.py:139, in SpeechT5Processor.pad(self, *args, **kwargs)
    134     raise ValueError(
    135         "You need to specify either an `input_values`, `input_ids`, or `labels` input to be padded."
    136     )
    138 if input_values is not None:
--> 139     inputs = self.feature_extractor.pad(input_values, *args, **kwargs)
    140 elif input_ids is not None:
    141     inputs = self.tokenizer.pad(input_ids, **kwargs)

File /usr/local/lib/python3.9/dist-packages/transformers/feature_extraction_sequence_utils.py:224, in SequenceFeatureExtractor.pad(self, processed_features, padding, max_length, truncation, pad_to_multiple_of, return_attention_mask, return_tensors)
    221             value = value.astype(np.float32)
    222         batch_outputs[key].append(value)
--> 224 return BatchFeature(batch_outputs, tensor_type=return_tensors)

File /usr/local/lib/python3.9/dist-packages/transformers/feature_extraction_utils.py:78, in BatchFeature.__init__(self, data, tensor_type)
     76 def __init__(self, data: Optional[Dict[str, Any]] = None, tensor_type: Union[None, str, TensorType] = None):
     77     super().__init__(data)
---> 78     self.convert_to_tensors(tensor_type=tensor_type)

File /usr/local/lib/python3.9/dist-packages/transformers/feature_extraction_utils.py:181, in BatchFeature.convert_to_tensors(self, tensor_type)
    179         if key == "overflowing_values":
    180             raise ValueError("Unable to create tensor returning overflowing values of different lengths. ")
--> 181         raise ValueError(
    182             "Unable to create tensor, you should probably activate padding "
    183             "with 'padding=True' to have batched tensors with the same length."
    184         )
    186 return self

ValueError: Unable to create tensor, you should probably activate padding with 'padding=True' to have batched tensors with the same length.

Here is the relevant code snippet throwing the error:

        input_values = [{"input_values": feature["input_values"]} for feature in features]
        labels = [{"labels": feature["labels"]} for feature in features]
        speaker_features = [feature["speaker_embeddings"] for feature in features]

        # collate the inputs and targets into a batch - error occurs here
        batch = processor.pad(
            input_values=input_values,
            labels=labels,
            return_tensors="pt",
            padding=True
        )

Would anyone have an idea why this error could be happening?

Hi I am currently hyvinf kind of the same error when i try to use processor.pad, have you found any solution to this problem?

Can you provide reproducible code?