What is the best way to introduce generic tokens to the pre-processing step of pretrained NLP models

GraemeSmith · January 12, 2023, 12:59pm

I am new to transformers. If I was to create a model based on bag of words or TF-IDF, I can substitute words/grams with generic tokens such as [PERSON], [NUMBER ]and [COMPANY]. But with pre-trained models such as Bert, these will be treated as a literal interpretation.

What is the best way to replace text with a token prior to the tokenization step? Most NLP models use the context so I don’t think I can just use “person”/“number”/“company”. To make it grammatically correct I should probably use “a person”/“a number”/“a company”. To make it more difficult I probably should use “the person”/“the number”/“the company” when referring to the same thing. And it should have to account for possesive, “Andrew’s house” should be “the person’s house”.

What are the recommended ways to approach this?

Topic		Replies	Views
Process to adding new tokens to a corpus and subsequently training the corresponding word embeddings Beginners	0	3761	April 21, 2021
Train a new tokenizer from scratch 🤗Transformers	4	1707	November 10, 2020
BERT for Speech 🤗Transformers	1	412	February 24, 2021
Do you need to use the associated tokenizer Beginners	2	567	June 6, 2022
How to use transformers&tensorflow for batch inference Beginners	0	527	August 20, 2021

What is the best way to introduce generic tokens to the pre-processing step of pretrained NLP models

Related topics