Tied weights for encoder and decoder vocab matrix hard coded in T5?

Hi! I’m looking to get some clarification regarding questions I have about tied weights in encoder-decoder models in the transformers library.

In encoder-decoder models, when it comes to vocab matrices and the decoder LM head, you have the option to train with:

  1. All tied weights (encoder/decoder vocab embeddings and LM head share weights).
  2. Vocab embeddings shared (encoder/decoder share embeddings, but LM head does not).
  3. Tied decoder (decoder vocab embeddings and decoder LM head share weights).
  4. Untied weights: None of the above share weights.

In the PreTrainedConfig there are two arguments that exist to control whether weights should be tied or not: tie_encoder_decoder (default value is False) and tie_word_embeddings (default value is True).

When looking at several different T5 models on the Hub, I however noticed they all have tied encoder and decoder vocab embeddings, despite not specifying values for tie_encoder_decoder in their configs. Looking at the T5 implementation I find logic for decoupling decoder vocab embeddings and decoder LM head, but not for decoupling encoder/decoder vocab embeddings.

The reason I’m asking is because we’re looking at pretraining an UL2 model with the NeMo library. In this paper they perform ablations on different T5 pretraining configurations, and arrive at the result:

Interestingly, untying the encoder/decoder embeddings improve performance with only a modest increase in parameter count.

In their experiments, the best configuration is tying the decoder embeddings and LM head, while untying the encoder/decoder embeddings.

This goes against t5 v1.1 which decoupled the decoder input and lm head.

It’s important to us that our trained model can be converted to Huggingface format and be accessible through the transformers library. Thus, I wanted to check:

  1. Did I understand the transformers library T5 implementation correctly? It is currently possible to untie decoder weights (tie_word_embeddings=False) but not possible to untie encoder/decoder input weights?
  2. If we were to train with untied encoder/decoder input weights, would you be open to pull request or similar to add support for untied encoder/decoder input weights in the T5 implementation?