Fine tune a saved model with custom tokenizer

Adel · December 10, 2020, 2:02pm

I am using a RoBERTa based model for pre-training and fine-tuning.

To pre-train, I use RobertaForMaskedLM with a customized tokenizer . This means I used my tokenizer in the LineByLineTextDataset() and pre-trained my model for masked language modeling.

However, for fine tuning, When I want to use my dataset with labels for a classification task, I think I must use my customized tokenizer before feeding my data to the model for fine tuning.

My question is, How can I use my tokenizer to prepare the data and fine tune my pre-trained model?

valhalla · December 11, 2020, 9:58am

hi @Adel

You could save your custom tokenizer using the save_pretrained
method and then load it again using from_pretrained method. So for classification fine-tuning you could just use the custom tokenizer. And if you are using the official transformer examples script then all you need to do is, pass the tokenizer using the --tokenizer_name_or_path argument.

Adel · December 11, 2020, 10:46am

Thank you @valhalla, I used my tokenizer to tokenize my input texts before feeding them to my ’ RobertaForSequenceClassification’ model. Since in the documentation it is mentioned that the RoBERTa uses BPE tokenizer:

RoBERTa has the same architecture as BERT, but uses a byte-level BPE as a tokenizer (same as GPT-2) and uses a different pretraining scheme.

I wonder if I use my custom tokenizer, the ‘RobertaForSequenceClassification’ still uses BPE and not using my custom tokenizer.

valhalla · December 15, 2020, 12:27pm

What that means is the RobertaTokenizer uses BPE for tokenization. BPE is one of the tokenization method, using which we train tokenizer. Hope this makes it clear.

Topic		Replies	Views
Domain adaptation of Language Model and Tokenizer Beginners	8	2879	June 17, 2024
RoBERTa from scratch with different vocab vs. fine-tuning Intermediate	9	2230	August 20, 2020
How do I use a fine-tuned Trainer model for inference correctly? 🤗Transformers	0	981	June 9, 2023
Further pre-training the tokenizer? 🤗Tokenizers	0	821	April 30, 2022
Can we use tokenizer from one architecture and model from another one? Beginners	2	871	September 30, 2021

Fine tune a saved model with custom tokenizer

Related topics