How to make a word ignorable to a model?

GPN · November 4, 2021, 4:06pm

Hi all

I am using a transformer model (e.g. T5, …) to make predictions on some text. The text happens to contain some “headline” tokens (<h1> or </h1>) that I would like to ignore. I could of course strip them from the text, do the predictions, and then restore them. But I would rather make them “ignorable” for example using the attention mask.

My problem: a “word” like <h1> or </h1> is not in the model vocabulary. Ideally I want some way to make the tokenizer recognize these “words” but set the attention mask to 0 for them. How do I do this? Or is there some other technique (without attention masks)?

Thank you

P.S. before posting I did check and could not find a post that covers this topic. Thanks in advance.

Topic		Replies	Views
Questions on model's tokens 🤗Tokenizers	0	601	March 24, 2021
Whole-word masking for T5 Beginners	2	522	November 28, 2023
How to restrict T5 model to generate tokens only from the input text? Intermediate	0	421	June 6, 2023
Fine-tuning BERT with deterministic masking instead of random masking Beginners	0	161	April 22, 2024
What is the best way to introduce generic tokens to the pre-processing step of pretrained NLP models 🤗Transformers	0	351	January 12, 2023

How to make a word ignorable to a model?

Related topics