I am working on a corpus of text in which I have injected instances of a special string: “[CHARACTER]”.
I now want to train a BERT classification model on this pre-processed corpus so I instantiate a BertTokenizer passing it my special token:
tokenizer = BertTokenizer.from_pretrained('bert-base-cased', additional_special_tokens = ['[CHARACTER]'])
This outputs the message:
Special tokens have been added in the vocabulary, make sure the associated word embedding are fine-tuned or trained.
So my question is how can I then go ahead and fine-tune the word embedding for this added special token?
I assume that this is a different process altogether then training the actual classifier itself? Could someone please enlighten me as to what needs to be done? Thanks.