Hi,
EncoderDecoderModel
is meant to combine any bidirectional text encoder (e.g. BERT) with any autoregressive text decoder (e.g. GPT2). We’re planning to add a VisionEncoderDecoderModel
(recently we’ve added SpeechEncoderDecoderModel
, which allows you to combine any speech autoencoding model such as Wav2Vec2 with any autoregressive text decoder).
Feel free to contribute this if you are interested!