I’m trying to find a way of processing longer audio files using SpeechEncoderDecoder. I know the CTC methods can achieve this by splitting the audio up into chunks and then joining the logits but I don’t know of any examples where a similar approach can be taken for transformer LM decoders.
My initial thought is to be able to somehow join the encoder embeddings of audio chunks and then use those as the
encoder_outputs value for the decoder generation. Initial testing hasn’t been successful, however.