The text I’m wanting to perform my downstream tasks on contains a lot of domain-specific references. When I say refernces I mean citations i.e. character-sequences that identify entities such as other documents.
Since those entitites are domain-specific none of the pre-trained models will understand them and they will probably be split into sub-tokens by BERT.
Those citations are extremely valuable in terms of the meaning of the content. Which citations a document includes can say a lot about the topics of the content.
Naturally, I’d like to preserve these citations and ideally also train meaningful emebddings for them.
How do I best go about doing that?
From what I can gather I could
- add the tokens to the tokenizer (via add_tokens())
- then use TFBertForPreTraining() to continue to train BERT with domain-specific content (which will include those citations).
Is that the right way to handle this?
I’m not sure if add_tokens() is actually meant to expand the vocab or is just for additional special tokens like [CLS].
As always, any pointers would be much appreciated.
So far this forum has been invaluable