Hi all
I am using a transformer model (e.g. T5, …) to make predictions on some text. The text happens to contain some “headline” tokens (<h1>
or </h1>
) that I would like to ignore. I could of course strip them from the text, do the predictions, and then restore them. But I would rather make them “ignorable” for example using the attention mask.
My problem: a “word” like <h1>
or </h1>
is not in the model vocabulary. Ideally I want some way to make the tokenizer recognize these “words” but set the attention mask to 0
for them. How do I do this? Or is there some other technique (without attention masks)?
Thank you
P.S. before posting I did check and could not find a post that covers this topic. Thanks in advance.