RuntimeError: The size of tensor a (553) must match the size of tensor b (448) at non-singleton dimension 1

I was fine-tuning whisper-small on tarteel-ai/everyayah, I was following Fine-Tune Whisper For Multilingual ASR with :hugs: Transformers amazing blog, here they use common-voice 11.0 but I change the dataset and take all necessary steps.

like:

from datasets import Audio

common_voice = common_voice.cast_column(“audio”, Audio(sampling_rate=16000))

def prepare_dataset(batch):
# load and resample audio data from 48 to 16kHz
audio = batch[“audio”]

# compute log-Mel input features from input audio array
batch["input_features"] = feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]

# encode target text to label ids
batch["labels"] = tokenizer(batch["sentence"]).input_ids
return batch

everyayah = everyayah.map(prepare_dataset, remove_columns=everyayah.column_names[“train”])

here i print the actual dataset:

print(everyayah[‘train’][0])
OUTPUT:

{‘audio’: {‘path’: None, ‘array’: array([ 0. , 0. , 0. , …, -0.00057983,
-0.00085449, -0.00061035]), ‘sampling_rate’: 16000}, ‘sentence’: ‘بِسْمِ اللَّهِ الرَّحْمَنِ الرَّحِيمِ’}

see, actual sampling_rate: 16000, but still i took this step;

from datasets import Audio
everyayah = everyayah.cast_column(“audio”, Audio(sampling_rate=16000))

OUTPUT after taking above step:

{‘audio’: {‘path’: None, ‘array’: array([ 0. , 0. , 0. , …, -0.00057983,
-0.00085449, -0.00061035]), ‘sampling_rate’: 16000}, ‘sentence’: ‘بِسْمِ اللَّهِ الرَّحْمَنِ الرَّحِيمِ’}

My training part is:

from transformers import Seq2SeqTrainingArguments

training_args = Seq2SeqTrainingArguments(
output_dir=“./whisper-small-ar-v2”, # change to a repo name of your choice
per_device_train_batch_size=16,
gradient_accumulation_steps=1, # increase by 2x for every 2x decrease in batch size
learning_rate=1e-5,
warmup_steps=500,
max_steps=4000,
gradient_checkpointing=True,
fp16=True,
evaluation_strategy=“steps”,
per_device_eval_batch_size=8,
predict_with_generate=True,
generation_max_length=225,
save_steps=1000,
eval_steps=1000,
logging_steps=25,
report_to=[“tensorboard”],
load_best_model_at_end=True,
metric_for_best_model=“wer”,
greater_is_better=False,
push_to_hub=True,
)

from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
args=training_args,
model=model,
train_dataset=everyayah[“train”],
eval_dataset=everyayah[“test”],
data_collator=data_collator,
compute_metrics=compute_metrics,
tokenizer=processor.feature_extractor,
)

I think i have an issue in padding or truncate.

ERROR:

RuntimeError Traceback (most recent call last)
Cell In[40], line 1
----> 1 trainer.train()

File N:\NOUMAN\Finetune Whisper\WhisperEnv\lib\site-packages\transformers\trainer.py:1828, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
1825 try:
1826 # Disable progress bars when uploading models during checkpoints to avoid polluting stdout
1827 hf_hub_utils.disable_progress_bars()
→ 1828 return inner_training_loop(
1829 args=args,
1830 resume_from_checkpoint=resume_from_checkpoint,
1831 trial=trial,
1832 ignore_keys_for_eval=ignore_keys_for_eval,
1833 )
1834 finally:
1835 hf_hub_utils.enable_progress_bars()

for a quick check, try with batch_size = 1
If it runs then the issue is likely due to inconsistent padding within batches.

Ok I will give it a try!
Thank you