Is there a complete Speech2Text example?


I am currently trying to train a Speech2TextModel from scratch but I can’t seem to find a complete example on how to do this.

I’ve started to go through it by myself but this turns out to be a trial & error kind of thing. For example, I don’t know how to create a Speech2TextTokenizer. I got my spm_file but how exactly should the vocab_file look like? Do I generate this from the SentencePieceProcesor? Why can’t I set the vocab file created by sentencpiece and so on…

Is there a comprehensive guide I overlooked?


I was able to proceed a bit further but only via debugging the code and some trial & error.

I am able to start the training now but I am not sure if I am using all the right pieces here. I don’t know if I need to use the Trainer or the Seq2SeqTrainer. My biggest problem is the data_collator as I have no clue what it’s supposed to return.

Right now I am returning the following:

class Speech2TextCollator:

    def __init__(self, processor: Speech2TextProcessor):
        self.processor = processor

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        inputs = [torch.Tensor(f["inputs"]) for f in features]
        targets = [torch.Tensor(f["targets"]) for f in features]
        # Create batches
        inputs_batch = pad_sequence(inputs, batch_first=True)
        targets_batch = pad_sequence(targets, batch_first=True).long()
        attention_mask = pad_sequence([f["attention_mask"] for f in features], batch_first=True).long()
        return dict(

Depending on whether I set label_smoothing_factor=1 for the TrainingArguments I get either a KeyError: 'logits' or KeyError: 'loss'.

Can somebody help me out here?