Whole-word masking for T5

OfekGlick · November 28, 2023, 2:52pm

Hi!
I want to write a custom Data Collator using the T5 tokenizer that masks whole words.
The thing is, it is difficult to know which tokens comprise a word. Some words are a single token while other words are comprised of more then 1 token.

My initial approach is for every word I should find that start and end index of the tokens in the list that comprise that word, But I don’t know if that would be any good.

How should I go about this task?

Thanks!

harlowh · November 28, 2023, 3:21pm

When you’re dealing with tokenization and need to mask whole words using the T5 tokenizer, one approach is to use the tokenized output and identify the boundaries of each word in the tokenized sequence. The T5 tokenizer in the transformers library provides a method called tokenize_plus that can be helpful for this task.

Here’s a step-by-step guide on how you can create a custom data collator to mask whole words using the T5 tokenizer:
from transformers import T5Tokenizer
from transformers import DataCollatorForLanguageModeling
import torch

class CustomDataCollator(DataCollatorForLanguageModeling):
def init(self, tokenizer, mlm=True, mlm_probability=0.15):
super().init(tokenizer=tokenizer, mlm=mlm, mlm_probability=mlm_probability)

def mask_words(self, input_ids, labels):
    masked_input_ids = input_ids.clone()

    for i in range(len(labels)):
        # Identify the start and end index of each word in the tokenized sequence
        start_idx = (input_ids[i] == self.tokenizer.pad_token_id).nonzero().item() + 1
        end_idx = len(input_ids[i]) - (input_ids[i][::-1] == self.tokenizer.pad_token_id).nonzero().item() - 1

        # Mask the entire word
        masked_input_ids[i, start_idx:end_idx] = self.tokenizer.mask_token_id
        labels[i, start_idx:end_idx] = input_ids[i, start_idx:end_idx].clone()

    return masked_input_ids, labels

def __call__(self, examples):
    batch = self._tensorize_batch(examples)
    input_ids, labels = self.mask_tokens(batch["input_ids"], batch["labels"])

    # Additional step to mask whole words
    masked_input_ids, masked_labels = self.mask_words(input_ids, labels)

    return {"input_ids": masked_input_ids, "labels": masked_labels}

Example usage

tokenizer = T5Tokenizer.from_pretrained(“t5-small”)
data_collator = CustomDataCollator(tokenizer)

Dummy data for demonstration

dummy_data = [{“text”: “This is an example sentence.”}, {“text”: “Another example for testing.”}]
encoded_data = tokenizer(dummy_data, return_tensors=“pt”, padding=True)

Apply the custom data collator

masked_batch = data_collator(encoded_data[“input_ids”])

print(“Input IDs:”, masked_batch[“input_ids”])
print(“Labels:”, masked_batch[“labels”])

OfekGlick · November 28, 2023, 4:17pm

Hi, thanks for the reply?
I might be missing something, but where is the use of tokenize_plus?
Also, since we are dealing with a causal model, should mlm be set to False?
In the case where I am want to pre-train a t5 and I simply want to mask words, what are the labels supposed to be in this case?

Topic		Replies	Views
Code about DataCollatorForWholeWordMask in github 🤗Transformers	0	554	October 12, 2022
How to use whole word masking data_collator? Beginners	8	3066	June 15, 2024
How to denoise text using T5? 🤗Transformers	2	677	May 8, 2023
Masking specific token in each input sentence during Masked language modelling 🤗Transformers	0	1035	October 18, 2021
Mask only specific words 🤗Tokenizers	4	3705	November 7, 2021

Whole-word masking for T5

Example usage

Dummy data for demonstration

Apply the custom data collator

Related topics