How to make a word ignorable to a model?

Hi all

I am using a transformer model (e.g. T5, …) to make predictions on some text. The text happens to contain some “headline” tokens (<h1> or </h1>) that I would like to ignore. I could of course strip them from the text, do the predictions, and then restore them. But I would rather make them “ignorable” for example using the attention mask.

My problem: a “word” like <h1> or </h1> is not in the model vocabulary. Ideally I want some way to make the tokenizer recognize these “words” but set the attention mask to 0 for them. How do I do this? Or is there some other technique (without attention masks)?

Thank you

P.S. before posting I did check and could not find a post that covers this topic. Thanks in advance.