I’m trying to find a way of processing longer audio files using SpeechEncoderDecoder. I know the CTC methods can achieve this by splitting the audio up into chunks and then joining the logits but I don’t know of any examples where a similar approach can be taken for transformer LM decoders.
My initial thought is to be able to somehow join the encoder embeddings of audio chunks and then use those as the encoder_outputs
value for the decoder generation. Initial testing hasn’t been successful, however.
@anton-l you mentioned feeding in hidden states in that issue, could this work for the SpeechEncoderDecoder
decoder? Also, regarding the streaming inference in that issue, is this CTC only?
Thanks!