Joining SpeechEncoderDecoder embedding chunks for processing longer audio

I’m trying to find a way of processing longer audio files using SpeechEncoderDecoder. I know the CTC methods can achieve this by splitting the audio up into chunks and then joining the logits but I don’t know of any examples where a similar approach can be taken for transformer LM decoders.

My initial thought is to be able to somehow join the encoder embeddings of audio chunks and then use those as the encoder_outputs value for the decoder generation. Initial testing hasn’t been successful, however.

@anton-l you mentioned feeding in hidden states in that issue, could this work for the SpeechEncoderDecoder decoder? Also, regarding the streaming inference in that issue, is this CTC only?


I’m also interested in how to do chunking for the speech encoder decoder models. Has anyone figured out how to do this?