How to handle "entities" during tokenization?

neuralpat · March 10, 2021, 9:17am

Hi everyone,

The text I’m wanting to perform my downstream tasks on contains a lot of domain-specific references. When I say refernces I mean citations i.e. character-sequences that identify entities such as other documents.
Since those entitites are domain-specific none of the pre-trained models will understand them and they will probably be split into sub-tokens by BERT.
Those citations are extremely valuable in terms of the meaning of the content. Which citations a document includes can say a lot about the topics of the content.
Naturally, I’d like to preserve these citations and ideally also train meaningful emebddings for them.

How do I best go about doing that?

From what I can gather I could

add the tokens to the tokenizer (via add_tokens())
then use TFBertForPreTraining() to continue to train BERT with domain-specific content (which will include those citations).

Is that the right way to handle this?

I’m not sure if add_tokens() is actually meant to expand the vocab or is just for additional special tokens like [CLS].

As always, any pointers would be much appreciated.

So far this forum has been invaluable

salti · March 10, 2021, 1:37pm

I think what you suggested is a reasonable approach.
Just don’t forget to do model.resize_token_embeddings(len(tokenizer)) after using tokenizer.add_tokens
You can then, as you suggested, train your BERT model with MLM loss on your specific corpus to learn these new embeddings and proceed after that to fine-tune it on different tasks.

Topic		Replies	Views
Working with named entities with bert Beginners	2	315	August 30, 2020
Process to adding new tokens to a corpus and subsequently training the corresponding word embeddings Beginners	0	3761	April 21, 2021
Question for Input of BERT Beginners	2	304	December 15, 2020
How to "further pretrain" a tokenizer (do I need to do so?) 🤗Tokenizers	5	4384	February 20, 2022
Identifying most useful domain-specific tokens for adding to the existing tokenizer Intermediate	1	478	February 2, 2024

How to handle "entities" during tokenization?

Related topics