Fine-Tuning a Text2Text Model using different tokenizer

Hello everyone,

I’m just starting to explore the Hugging Face library and have a question related to Text2Text models.

Suppose I have a model1 (a Text2Text model) pre-trained on a masked language modeling task, where it has learned the syntactic structure based on the tokenization strategy of tokenizer1.

Now, I want to fine-tune model1 using a similar style of text related to the masked language modeling task as input, but aim to decode outputs into a different format using a separate tokenizer (tokenizer2).

Is this possible? The approach I had in mind involves sequential text generation:

  1. The original model1 generates text.

  2. A fine-tuned model2 continues the generation based on the output of model1.

Apologies if this is something trivial. Any comment or suggestion on specific tutorials is really appreciated!

1 Like

Hello. If the output of Model 1 is just normal text, then there’s no problem at all. There are many people who connect AI models and other normal programs in series, not just text-to-text. The same goes for images and audio.
The performance improvements that can be achieved by combining small AI models with large AI models are often highlighted.
If the output of Model 1 is not just normal text, but rather tokens or tensors, then I think you’ll have quite a bit of trouble…:sweat_smile:

Thanks John, I think I’m just a bit confused. Does it make sense to train an Encoder-Decoder model with two different tokenizers/vocabularies? I guess that might be standard , thinking about the task of translation between two different languages having different characters.

1 Like

This is just an amateur’s opinion, but if you train a single model using two or more different tokenizers, the performance when you make the model infer using the same tokenizer as when you trained it will almost certainly improve, but overall it will be hard to tell what kind of effect it has on the model.
It seems like the safest way to get predictable training results is to use one model and one tokenizer…

It is possible to use separate tokenizer for encoder and decoder (input and output) layers. However, this is uncommon for text2text models. I can how this can be useful. I’m not sure if HF supports this natively - all indications suggest that text2text models assume that input and output uses the same tokenizer.

Another complication is that most pretrained models these days share the input and output Linear layers (token to embedding, and embedding to token probability). This is not compatible with separate tokenizers for encoder and decoder.

1 Like

Thanks for your reply. However, I don’t really understand why it would be uncommon to have different tokenizers for input and output languages. Say that I want to build a Text2Text model for the task of translation from italian to chinese. It is obvious that I would have to construct two different vocabularies in order to train the model to map the input ids of the source language (italian) to the output ids (chinese).

1 Like