Audio Course. Data collator for ASR fine-tuning

Unit 5. ASR. Lesson " Fine-tuning the ASR model" is quite similar to a blog post on Whisper Fine-tuning

But there is a difference in the DataCollatorSpeechSeq2SeqWithPadding definition:


class DataCollatorSpeechSeq2SeqWithPadding:
    processor: Any

    def __call__(
        self, features: List[Dict[str, Union[List[int], torch.Tensor]]]
    ) -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lengths and need different padding methods
        # first treat the audio inputs by simply returning torch tensors
        input_features = [
            {"input_features": feature["input_features"][0]} for feature in features
        ]

The course version has [0] in the last line above, but the blog version does not. So I wonder which one is correct? (I have trouble with both: one does not converge, the other fails, so I will help to know which one to fix.)

I will also be grateful for an explanation of what arguments of this function are at runtime. I.e. what are feature and features in “feature in features”?