I am using a transformer model (e.g. T5, …) to make predictions on some text. The text happens to contain some “headline” tokens (
</h1>) that I would like to ignore. I could of course strip them from the text, do the predictions, and then restore them. But I would rather make them “ignorable” for example using the attention mask.
My problem: a “word” like
</h1> is not in the model vocabulary. Ideally I want some way to make the tokenizer recognize these “words” but set the attention mask to
0 for them. How do I do this? Or is there some other technique (without attention masks)?
P.S. before posting I did check and could not find a post that covers this topic. Thanks in advance.