Model.generate generates way too long outputs

Hi I trained a SpeechT5 model from scratch on an ASR task and i now want to use it for inference withe model.generate but somehow my generates ids are way longer than the labeld output… When i ran some metrics during the training with compute metrics, the predicted output length of the text by the model was nearly always the same for the labeled output, which worked very fine. But however with model.generate i actually only get putputs that are way too long, here is an example for when i use model.generate:

predicted output: [‘tempo:120 s1:f0 s2:f3 s3:f2 s4:f0 s5:f0 480 s1:f0 s2:f3 s3:f2 s4:f0 s5:f0 480 s1:f0 s2:f3 s3:f0 s4:f0 s5:f0 480 s1:f0 s2:f3 s3:f0 s4:f0 s5:f0 480 s1:f0 s2:f3 s3:f0 s4:f0 s5:f0 480 s1:f0 s2:f3 s3:f0 s4:f0 s5:f0 480 s1:f0 s2:f3 s3:f0 s4:f0 s5:f0 480 s1:f0 s2:f3 s3:f0 s4:f0 s5:f0 480 s1:f0 s2:f3 s3:f0 s4:f0 s5:f0 480 s1:f0 s2:f3 s3:f0 s4:f0 s5:f0 480 s1:f0 s2:f3 s3:f0 s4:f0 s5:f0 480 s1:f0 s2:f3 s3:f0 s4:f0 s5:f0 480 s1:f0 s2:f3 s3:f0 s4:f0 s5:f0 480 s1:f0 s2:f3 s3:f0 s4:f0 s5:f0 480 s1:f0 s2:f3 s3:f0 s4:f0 s5:f0 480 s1:f0 s2:f3 s3:f0 s4:f0 s5:f0 480 s1:f0 s2:f3 s3:f0 s4:f0 s5:f0 480 s1:f0 s2:f3 s3:f0 s4:f0 s5:f0 480 s1:f0 s2:f3 s3:f0 s4:f0 s5:f0 480 s1:f0 s2:f3 s3:f0 s4:f0 s5:f0 480 s1:f0 s2:f3 s3:f0 s4:f0 s5:f0 480 s1:f0 s2:f3 s3:f0 s4:f0 s5:f0 480 s1:f0 s2:f3 s3:f0 s4:f0 s5:f0 480 s1:f0 s2:f3 s3:f0 s4:f0 s5:f0 480 s1:f0 s2:f3 s3:f0 s4:f0 s5:f0 480 s1:f0 s2:f0 s3:f0 s4:f0 s5:f0 480 s1:f0 s2:f0 s3:f0 s4:f0 s5:f0 480 s1:f0 s2:f0 s3:f0 s4:f0 s5:f0 480 s1:f0 s2:f0 s3:f0 s4:f0 s5:f0 480 s1:f0 s2:f0 s3:f0 s4:f0 s5:f0 480 s1:f0 s2:f0 s3:f0 s4:f0 s5:f0 480 s1:f0 s2:f0 s3:f0 s4:f0 s5:f0 480 s1:f0 s2:f0 s3:f0 s4:f0 s5:f0 480 s1:f0 s2:f0 s3:f0 s4:f0 s5:f0 480 s1:f0 s2:f0 s3:f0 s4:f0 s5:f0 480 s1:f0 s2:f0 s3:f0 s4:f0 s5:f0 480 s1:f0 s2:f0 s3:f0 s4:f0 s5:f0 480 s1:f0 s2:f0 s3:f0 s4:f0 s5:f0 480 s1:f0 s2:f0 s3:f0 s4:f0 s5:f0 480 s1:f0 s2:f0 s3:f0 s4:f0 s5:f0 480 s1:f0 s2:f0 s3:f0 s4:f0 s5:f0 480 s1:f0 s2:f0 s3:f0 s4:f0 s5:f0 480 s1:f0 s2:f0 s3:f0 s4:f0 s5:f0 480 s1:f0 s2:f0 s3:f0 s4:f0 s5:f0 480 s1:f0 s2:f0 s3:f0 s4:f0 s5:f0 480 s1:f0 s2:f0 s3:f0 s4:f0 s5:f0 480 s1:f0 s2:f0 s3:f0 s4:f0 s5:f0 480 s1:f0 s2:f0 s3:f0 s4:f0 s5:f0 480 s1:f0 s2:f0 s3:f0 s4:f0 s5:f0 480 s1:f0 s2:f0 s3:f0 s4:f0 s5:f0 480 s1:f0 s2:f0 s3:f0 s4:f0 s5:f0 480 s1:f0 s2:f0 s3:f0 s4:f0 s5:f0 480 s1:f0 s2:f0 s3:f0 s4:f0 s5:f0 480 s1:f0 s2:f0 s3:f0 s4:f0 s5:f0 480 s1:f0 s2:f0 s3:f0 s4:f0 s5:f0 480 s1:f0 s2:f0 s3:f0 s4:f0 s5:f0 480 s1:f0 s2:f0 s3:f0 s4:f0 s5:f0 480 s1:f0 s2:f0 s3:f0 s4:f0 s5:f0 480 s1:f0 s2:f0 s3:f0 s4:f0 s5:f0 480 s1:f0 s2:f0 s3:f0 s4:f0 s5:f0 480 s1:f0 s2:f0 s3:f0 s4:f0 s5:f0 480 s1:f0 s2:f0 s3:f0 s4:f0 s5:f0 480 s1:f0 s2:f0 s3:f0 s4:f0 s5:f0 480 s1:f0 s2:f0 s3:f0 s4:f0 s5:f0 480 s1:f0 s2:f0 s3:f0 s4:f0 s5:f0 480 s1:f0 s2:f0 s3:f0 s4:f0 s5:f2 480 s1:f0 s2:f0 s3:f0 s4:f0 s5:f0 480 s1:f0 s2:f0 s3:f0 s4:f0 s5:f0 480 s1:f0 s2:f0 s3:f0 s4:f0 s5:f3 480 s1:f0 s2:f0 s3:f0 s4:f0 s5:f3 480 s1:f0 s2:f0 s3:f0 s4:f0 s5:f0 480 s1:f0 s2:f0 s3:f0 s4:f0 s5:f0 480 s1:f0 s2:f0 s3:f0 s4:f0 s5:f0 480 s1:f0 s2:f0 s3:f0 s4:f0 s5:f3 480 s1:f0 s2:f0 s3:f0 s4:f0 s5:f3 480 s1:f0 s2:f0 s3:f0 s4:f0 s5:f3 480 s1:f0 s2:f0 s3:f0 s4:f0 s5:f3 480 s1:f0 s2:f0 s3:f0 s4:f0 s5:f0 480 s1:f0 s2:f0 s3:f0 s4:f0 s5:f3 480 s1:f0 s2:f1 s3:f0 s4:f0 s5:f3 480 s1:f0 s2:f1 s3:f0 s4:f0 s5:f3 480 s1:f0 s2:f1 s3:f0 s4:f0 s5:f3 480 s1:f0 s2:f0 s3:f0 s4:f0 s5:f3 480’],

labeled output: tempo:120 s2:f3 s3:f2 s4:f0 960 s4:f3 480 s4:f4 480 s2:f3 s3:f2 s4:f0 1920 s2:f0 s3:f0 s4:f0 3840 s6:f3 480 s4:f0 480 s3:f0 480 s5:f0 960 s3:f0 480 s5:f2 480 s3:f0 480 s5:f3 480 s3:f0 480 s3:f0 480 s5:f2 960 s3:f0 480 s6:f3 480 s3:f0 480 s3:f2 s4:f0 960 s3:f2 s4:f0 480 s3:f4 s4:f0 480 rest 480 s3:f2 s4:f0 480 rest 480 s3:f4 s4:f0 480

As you can see the predicted text by the model is much longer than the output it should have predicted (labeled output), during training the length was nearly always the same for every output… This is my code that I use for inference:

import sys
from datasets import load_dataset_builder, load_dataset, dataset_dict, Dataset, load_metric
from transformers import Seq2SeqTrainer, Speech2TextForConditionalGeneration, Speech2TextConfig, Speech2TextProcessor,
Speech2TextTokenizer, Speech2TextFeatureExtractor, trainer_utils,
Trainer, DataCollatorWithPadding, Speech2TextModel, TrainingArguments, Seq2SeqTrainingArguments, DataCollatorForSeq2Seq,
SpeechT5Processor, SpeechT5ForSpeechToText, Seq2SeqTrainingArguments, Seq2SeqTrainer, SpeechT5Config, SpeechT5FeatureExtractor, SpeechT5Tokenizer,
SequenceFeatureExtractor
import torch
import sentencepiece as spm
import numpy as np
import tensorboard
from jiwer import wer
from sklearn.metrics import f1_score
import psutil
from dataclasses import dataclass, field
from typing import Any, Dict, List, Optional, Union
from datasets import Audio
import evaluate

tokenizer = SpeechT5Tokenizer(vocab_file=r"/home/ec2-user/SageMaker/Tokenizer/word_nur_Zahlen.model")

feature_extractor = SpeechT5FeatureExtractor(return_attention_mask=False, do_normalize=True)

processor = SpeechT5Processor(feature_extractor=feature_extractor, tokenizer=tokenizer)

test_dataset = load_dataset(‘audiofolder’, data_dir=r"/home/ec2-user/SageMaker/Datasets/Dataset Folder 2 Splitted/Test")
test_dataset = test_dataset[‘train’]
print(“Test Data loaded:”, test_dataset)

test_dataset = test_dataset.cast_column(“audio”, Audio(sampling_rate=16000))
print(“Test Data features:”, test_dataset.features)

model = SpeechT5ForSpeechToText.from_pretrained(r"/home/ec2-user/SageMaker/SpeechT5/Model 3 Word 10Sek 16Ep/Model Save")

print(model.can_generate)

input_ids = processor(audio=test_dataset[“audio”][0][“array”], sampling_rate=16000, return_tensors=“pt”)

print(len(test_dataset[“audio”][0][“array”]))

print(input_ids)

predicted_ids = model.generate(**input_ids, max_length=500)

print(predicted_ids)

transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)

print(transcription)

print(test_dataset[0][‘transcription’])

Any help would be really apprecaited!