Fine tune a saved model with custom tokenizer

I am using a RoBERTa based model for pre-training and fine-tuning.

To pre-train, I use RobertaForMaskedLM with a customized tokenizer . This means I used my tokenizer in the LineByLineTextDataset() and pre-trained my model for masked language modeling.

However, for fine tuning, When I want to use my dataset with labels for a classification task, I think I must use my customized tokenizer before feeding my data to the model for fine tuning.

My question is, How can I use my tokenizer to prepare the data and fine tune my pre-trained model?

hi @Adel

You could save your custom tokenizer using the save_pretrained
method and then load it again using from_pretrained method. So for classification fine-tuning you could just use the custom tokenizer. And if you are using the official transformer examples script then all you need to do is, pass the tokenizer using the --tokenizer_name_or_path argument.

Thank you @valhalla, I used my tokenizer to tokenize my input texts before feeding them to my ’ RobertaForSequenceClassification’ model. Since in the documentation it is mentioned that the RoBERTa uses BPE tokenizer:

RoBERTa has the same architecture as BERT, but uses a byte-level BPE as a tokenizer (same as GPT-2) and uses a different pretraining scheme.

I wonder if I use my custom tokenizer, the ‘RobertaForSequenceClassification’ still uses BPE and not using my custom tokenizer.

What that means is the RobertaTokenizer uses BPE for tokenization. BPE is one of the tokenization method, using which we train tokenizer. Hope this makes it clear.