Fine-Tuning a Text2Text Model using different tokenizer

fedeotto · January 16, 2025, 5:12pm

Hello everyone,

I’m just starting to explore the Hugging Face library and have a question related to Text2Text models.

Suppose I have a model1 (a Text2Text model) pre-trained on a masked language modeling task, where it has learned the syntactic structure based on the tokenization strategy of tokenizer1.

Now, I want to fine-tune model1 using a similar style of text related to the masked language modeling task as input, but aim to decode outputs into a different format using a separate tokenizer (tokenizer2).

Is this possible? The approach I had in mind involves sequential text generation:

The original model1 generates text.
A fine-tuned model2 continues the generation based on the output of model1.

Apologies if this is something trivial. Any comment or suggestion on specific tutorials is really appreciated!

John6666 · January 17, 2025, 2:31am

Hello. If the output of Model 1 is just normal text, then there’s no problem at all. There are many people who connect AI models and other normal programs in series, not just text-to-text. The same goes for images and audio.
The performance improvements that can be achieved by combining small AI models with large AI models are often highlighted.
If the output of Model 1 is not just normal text, but rather tokens or tensors, then I think you’ll have quite a bit of trouble…

fedeotto · January 17, 2025, 9:26am

Thanks John, I think I’m just a bit confused. Does it make sense to train an Encoder-Decoder model with two different tokenizers/vocabularies? I guess that might be standard , thinking about the task of translation between two different languages having different characters.

John6666 · January 17, 2025, 3:19pm

This is just an amateur’s opinion, but if you train a single model using two or more different tokenizers, the performance when you make the model infer using the same tokenizer as when you trained it will almost certainly improve, but overall it will be hard to tell what kind of effect it has on the model.
It seems like the safest way to get predictable training results is to use one model and one tokenizer…

dhruvgrammarly · January 18, 2025, 3:09am

It is possible to use separate tokenizer for encoder and decoder (input and output) layers. However, this is uncommon for text2text models. I can how this can be useful. I’m not sure if HF supports this natively - all indications suggest that text2text models assume that input and output uses the same tokenizer.

Another complication is that most pretrained models these days share the input and output Linear layers (token to embedding, and embedding to token probability). This is not compatible with separate tokenizers for encoder and decoder.

fedeotto · January 20, 2025, 10:33am

Thanks for your reply. However, I don’t really understand why it would be uncommon to have different tokenizers for input and output languages. Say that I want to build a Text2Text model for the task of translation from italian to chinese. It is obvious that I would have to construct two different vocabularies in order to train the model to map the input ids of the source language (italian) to the output ids (chinese).

Topic		Replies	Views
Employing Different Tokenizers in a Translation Model Models	0	216	July 27, 2023
Custom tokenizer: finetune model or retrain model? 🤗Transformers	1	918	March 8, 2024
Finetune different language pair on pretrained translation model Models	1	953	May 26, 2022
Tokenizer effect on the fine-tuning Research	0	364	October 6, 2023
When should you train a custom tokenizer/language model? Beginners	0	340	October 9, 2021

Fine-Tuning a Text2Text Model using different tokenizer

Related topics