Hi!
I am currently trying to train a Speech2TextModel
from scratch but I can’t seem to find a complete example on how to do this.
I’ve started to go through it by myself but this turns out to be a trial & error kind of thing. For example, I don’t know how to create a Speech2TextTokenizer
. I got my spm_file
but how exactly should the vocab_file
look like? Do I generate this from the SentencePieceProcesor
? Why can’t I set the vocab file created by sentencpiece
and so on…
Is there a comprehensive guide I overlooked?
Update:
I was able to proceed a bit further but only via debugging the code and some trial & error.
I am able to start the training now but I am not sure if I am using all the right pieces here. I don’t know if I need to use the Trainer
or the Seq2SeqTrainer
. My biggest problem is the data_collator
as I have no clue what it’s supposed to return.
Right now I am returning the following:
@dataclass
class Speech2TextCollator:
def __init__(self, processor: Speech2TextProcessor):
self.processor = processor
def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
inputs = [torch.Tensor(f["inputs"]) for f in features]
targets = [torch.Tensor(f["targets"]) for f in features]
# Create batches
inputs_batch = pad_sequence(inputs, batch_first=True)
targets_batch = pad_sequence(targets, batch_first=True).long()
attention_mask = pad_sequence([f["attention_mask"] for f in features], batch_first=True).long()
return dict(
input_features=inputs_batch,
decoder_input_ids=targets_batch,
attention_mask=attention_mask,
labels=targets_batch
)
Depending on whether I set label_smoothing_factor=1
for the TrainingArguments
I get either a KeyError: 'logits'
or KeyError: 'loss'
.
Can somebody help me out here?