Joining SpeechEncoderDecoder embedding chunks for processing longer audio

Ollie · February 12, 2022, 10:47pm

I’m trying to find a way of processing longer audio files using SpeechEncoderDecoder. I know the CTC methods can achieve this by splitting the audio up into chunks and then joining the logits but I don’t know of any examples where a similar approach can be taken for transformer LM decoders.

My initial thought is to be able to somehow join the encoder embeddings of audio chunks and then use those as the encoder_outputs value for the decoder generation. Initial testing hasn’t been successful, however.

@anton-l you mentioned feeding in hidden states in that issue, could this work for the SpeechEncoderDecoder decoder? Also, regarding the streaming inference in that issue, is this CTC only?

Thanks!

laphangho · June 10, 2022, 7:07am

I’m also interested in how to do chunking for the speech encoder decoder models. Has anyone figured out how to do this?

Topic		Replies	Views
Whisper on long audio files -- support for chunking? 🤗Transformers	3	5831	April 21, 2023
Extracting output speech recognition features while chunking Models	0	282	July 14, 2022
Decding Large Audio Files Using Wav2Vec2ForCTC Model Models	2	746	October 28, 2021
How to use the encoder_outputs embeding to generate a sentence through decoder 🤗Transformers	0	210	October 29, 2022
Deploy whisper by passing last transcribed sentences to decoder's past_key values 🤗Transformers	0	301	March 20, 2023

Joining SpeechEncoderDecoder embedding chunks for processing longer audio

Related topics