Hi @sanchit-gandhi , thanks so much for the reply!!
First, I think there’s two modifications needed in your function:
# compute input length
batch["input_length"] = len(batch["audio"]['array'])
and
return 0 < labels_length
May I know why the labels_length
is not directly len(batch["labels"])
but with add_special_tokens=False
? And why input_length
is not len(batch['inpute_features']
?
So I did look into the preprocessing steps, and turns out this is indeed caused by input length mismatch?
Token indices sequence length is longer than the specified maximum sequence length for this model (459 > 448). Running this sequence through the model will result in indexing errors
But I thought this should be taken care of by the feature extractor to truncate, will it?
And I did the filter as you said, but I don’t think the dataset has any empty input, rather filtered out 7 longer entries. However, according to the preprocess warnings, there should be 10 problematic sequences.
And there’s still the same error even after the filter.