I am new to transformers. If I was to create a model based on bag of words or TF-IDF, I can substitute words/grams with generic tokens such as [PERSON], [NUMBER ]and [COMPANY]. But with pre-trained models such as Bert, these will be treated as a literal interpretation.
What is the best way to replace text with a token prior to the tokenization step? Most NLP models use the context so I don’t think I can just use “person”/“number”/“company”. To make it grammatically correct I should probably use “a person”/“a number”/“a company”. To make it more difficult I probably should use “the person”/“the number”/“the company” when referring to the same thing. And it should have to account for possesive, “Andrew’s house” should be “the person’s house”.
What are the recommended ways to approach this?