I am trying to do voice emotion conversion by finetuning a voice conversion speecht5 model. Im getting this error with a custom DataCollator I defined (input_values are the processed audio inputs using SpeechT5 processor, labels are the spectrogram targets):
Cell In [18], line 19, in DataCollatorWithPadding.__call__(self, features)
16 speaker_features = [feature["speaker_embeddings"] for feature in features]
18 # collate the inputs and targets into a batch
---> 19 batch = processor.pad(
20 input_values=input_values,
21 labels=labels,
22 return_tensors="pt",
23 padding=True
24 )
26 # replace padding with -100 to ignore loss correctly
27 batch["labels"] = batch["labels"].masked_fill(
28 batch.decoder_attention_mask.unsqueeze(-1).ne(1), -100
29 )
File /usr/local/lib/python3.9/dist-packages/transformers/models/speecht5/processing_speecht5.py:139, in SpeechT5Processor.pad(self, *args, **kwargs)
134 raise ValueError(
135 "You need to specify either an `input_values`, `input_ids`, or `labels` input to be padded."
136 )
138 if input_values is not None:
--> 139 inputs = self.feature_extractor.pad(input_values, *args, **kwargs)
140 elif input_ids is not None:
141 inputs = self.tokenizer.pad(input_ids, **kwargs)
File /usr/local/lib/python3.9/dist-packages/transformers/feature_extraction_sequence_utils.py:224, in SequenceFeatureExtractor.pad(self, processed_features, padding, max_length, truncation, pad_to_multiple_of, return_attention_mask, return_tensors)
221 value = value.astype(np.float32)
222 batch_outputs[key].append(value)
--> 224 return BatchFeature(batch_outputs, tensor_type=return_tensors)
File /usr/local/lib/python3.9/dist-packages/transformers/feature_extraction_utils.py:78, in BatchFeature.__init__(self, data, tensor_type)
76 def __init__(self, data: Optional[Dict[str, Any]] = None, tensor_type: Union[None, str, TensorType] = None):
77 super().__init__(data)
---> 78 self.convert_to_tensors(tensor_type=tensor_type)
File /usr/local/lib/python3.9/dist-packages/transformers/feature_extraction_utils.py:181, in BatchFeature.convert_to_tensors(self, tensor_type)
179 if key == "overflowing_values":
180 raise ValueError("Unable to create tensor returning overflowing values of different lengths. ")
--> 181 raise ValueError(
182 "Unable to create tensor, you should probably activate padding "
183 "with 'padding=True' to have batched tensors with the same length."
184 )
186 return self
ValueError: Unable to create tensor, you should probably activate padding with 'padding=True' to have batched tensors with the same length.
Here is the relevant code snippet throwing the error:
input_values = [{"input_values": feature["input_values"]} for feature in features]
labels = [{"labels": feature["labels"]} for feature in features]
speaker_features = [feature["speaker_embeddings"] for feature in features]
# collate the inputs and targets into a batch - error occurs here
batch = processor.pad(
input_values=input_values,
labels=labels,
return_tensors="pt",
padding=True
)
Would anyone have an idea why this error could be happening?