Does training tokenizer and adding new token to model when training BART on custom dataset improve performance?

I have a dataset composed of 300,000 articles. Is it wise to train tokenizer on my own dataset and add new token to the model then train BART?
Is my dataset large enough to pre-train the embedding of new tokens added?

If the domain of your dataset isn’t very specialized from the pre-training datatset then training tokenizer won’t help much. Also note that if you train a new tokenizer then you’ll need to do the pre-training again, and when you train your tokenizer from scratch you don’t need to add the new tokens any more since you already trained tokenizer. Hope this makes sense.

Is my dataset large enough

IMO this is highly subjective question and depends on the quality of the data, the task etc.
@sshleifer might have some advice for this

Interested to see if this improves performance, but I suspect it will not.

Hi,

I’m just curious, on this thread- is it possible to make a model tokenizer agnostic? It seems that for every new domain, one would have to create a tokenizer and then retrain the model, which seems highly inefficient.