Finetunning on a new corpus for Conditional Generation. Should I train from scratch?

Hello guys,

I’m trying to figure out the best strategy to pursue. The scenario is the following:

  1. I’d like to use FLAN-T5 as the base model
  2. I have a corpus in Brazilian Portuguese, so I want to fine-tune the original model (for Conditional Generation, not for a specific task that will be achieved later).

What to do from now on? Should I train a custom tokenizer and start from scratch, or should I leverage all that is in place (model and tokenizer) and just run some epochs to get the model up and running on my language?

I understand if I go from scratch, I’m losing all training and knowledge stored in the model, but if that is the way to go if one wants to change or specialize for his language, it is ok. Your thoughts will be very much appreciated.

Thank you so much for this.