Does training tokenizer and adding new token to model when training BART on custom dataset improve performance?

LiuYangyang · September 16, 2020, 12:36pm

I have a dataset composed of 300,000 articles. Is it wise to train tokenizer on my own dataset and add new token to the model then train BART?
Is my dataset large enough to pre-train the embedding of new tokens added?

valhalla · September 16, 2020, 1:24pm

If the domain of your dataset isn’t very specialized from the pre-training datatset then training tokenizer won’t help much. Also note that if you train a new tokenizer then you’ll need to do the pre-training again, and when you train your tokenizer from scratch you don’t need to add the new tokens any more since you already trained tokenizer. Hope this makes sense.

Is my dataset large enough

IMO this is highly subjective question and depends on the quality of the data, the task etc.
@sshleifer might have some advice for this

sshleifer · September 16, 2020, 4:42pm

Interested to see if this improves performance, but I suspect it will not.

surya-narayanan · May 1, 2023, 7:14pm

Hi,

I’m just curious, on this thread- is it possible to make a model tokenizer agnostic? It seems that for every new domain, one would have to create a tokenizer and then retrain the model, which seems highly inefficient.

Topic		Replies	Views
What exacly is changed in the tokenizer after training it? Beginners	0	458	July 28, 2022
BART pre-training? Beginners	5	1839	August 5, 2023
Do you have to use a model card's accompanying tokenizer? Beginners	1	307	November 4, 2022
How to "further pretrain" a tokenizer (do I need to do so?) 🤗Tokenizers	5	4382	February 20, 2022
Pushing a custom tokenizer to the hub Beginners	0	333	April 14, 2023

Does training tokenizer and adding new token to model when training BART on custom dataset improve performance?

Related topics