I am using a RoBERTa based model for pre-training and fine-tuning.
To pre-train, I use RobertaForMaskedLM with a customized tokenizer . This means I used my tokenizer in the LineByLineTextDataset() and pre-trained my model for masked language modeling.
However, for fine tuning, When I want to use my dataset with labels for a classification task, I think I must use my customized tokenizer before feeding my data to the model for fine tuning.
My question is, How can I use my tokenizer to prepare the data and fine tune my pre-trained model?
You could save your custom tokenizer using the save_pretrained
method and then load it again using from_pretrained method. So for classification fine-tuning you could just use the custom tokenizer. And if you are using the official transformer examples script then all you need to do is, pass the tokenizer using the --tokenizer_name_or_path argument.
Thank you @valhalla, I used my tokenizer to tokenize my input texts before feeding them to my ’ RobertaForSequenceClassification’ model. Since in the documentation it is mentioned that the RoBERTa uses BPE tokenizer:
What that means is the RobertaTokenizer uses BPE for tokenization. BPE is one of the tokenization method, using which we train tokenizer. Hope this makes it clear.