In SpeechSeq2Seq models, is it possible to pass decoder_input_ids for each sample during the training time using huggingface Trainer?

ar5entum · December 12, 2024, 11:59am

I am training a custom speech encoder-decoder model using a wav2vec2 encoder and a decoder from a language model (e.g., BART). This architecture can be trained with AutoModelForSpeechSeq2Seq and works well so far. I noticed that my setup resembles Whisper in some ways. Whisper describes itself as a multilingual and multitask model, which got me curious: How can I implement multitasking in my model?

For multilinguality, I prepend a language identifier token to the input text during training (e.g., [en] it's a really nice day outside). During generation, I can either force the decoder input IDs to include a specific language token (e.g., [en]) or let the model infer the language from the input audio. This approach works well for language-specific tasks.

However, for multitasking, the challenge arises because the model cannot infer the task type (e.g., translation vs. summarization) directly from the audio. I haven’t fully explored this concept yet, but I suspect it might be feasible to handle task-specific identifiers (like [summarization]) using a custom PyTorch training script. Unfortunately, implementing this would require significantly more time than I currently have available.

My question is:

Is it possible to incorporate task-specific identifiers (similar to language identifiers) in the training process using transformers.Seq2SeqTrainer?
If yes, how can I modify or extend Seq2SeqTrainer to support multitasking functionality?

Any guidance or pointers would be greatly appreciated!

Topic		Replies	Views
Whisper fine-tuning without Seq2SeqTrainer Intermediate	0	348	December 15, 2023
Use decoder_input_ids with deepspeed DeepSpeed	0	270	May 9, 2023
EnocederDecoder training/prediction with two tokenizers Beginners	1	779	October 22, 2024
How to use Seq2seq Trainer with my original "[MASK]" Beginners	2	719	October 22, 2020
What decoder inputs is the trainer creating when I use it with AutoModelForSeq2SeqLM and a model that needs Decoder Inputs? Beginners	0	183	May 13, 2023

In SpeechSeq2Seq models, is it possible to pass decoder_input_ids for each sample during the training time using huggingface Trainer?

Related topics