Long audio input for training?

I’m using whisper for ASR training. Our ASR needs to input a long audio with more than 30sec for training. I tried to use your ASR pipeline and found it useful for inference, but I did not find anything related to training (like an output related to loss etc). How can we apply whisper model (even if we frozen its layers) to the training process using long audio?