How to handle "entities" during tokenization?

Hi everyone,

The text I’m wanting to perform my downstream tasks on contains a lot of domain-specific references. When I say refernces I mean citations i.e. character-sequences that identify entities such as other documents.
Since those entitites are domain-specific none of the pre-trained models will understand them and they will probably be split into sub-tokens by BERT.
Those citations are extremely valuable in terms of the meaning of the content. Which citations a document includes can say a lot about the topics of the content.
Naturally, I’d like to preserve these citations and ideally also train meaningful emebddings for them.

How do I best go about doing that?

From what I can gather I could

  1. add the tokens to the tokenizer (via add_tokens())
  2. then use TFBertForPreTraining() to continue to train BERT with domain-specific content (which will include those citations).

Is that the right way to handle this?

I’m not sure if add_tokens() is actually meant to expand the vocab or is just for additional special tokens like [CLS].

As always, any pointers would be much appreciated.

So far this forum has been invaluable :slight_smile:

I think what you suggested is a reasonable approach.
Just don’t forget to do model.resize_token_embeddings(len(tokenizer)) after using tokenizer.add_tokens
You can then, as you suggested, train your BERT model with MLM loss on your specific corpus to learn these new embeddings and proceed after that to fine-tune it on different tasks.

1 Like