What would be the best strategy to mask only specific words during the LM training?
My aim is to mask only words of interest which I have previously collected in a list.
The issue arises since the tokenizer, not only splits a single word in multiple tokens, but it also adds special characters if the word does not occur at the begging of a sentence.
E.g.:
The word “Valkyria”:
- at the beginning of a sentences gets split as [‘V’, ‘alky’, ‘ria’] with corresponding IDs: [846, 44068, 6374].
- while in the middle of a sentence as [‘ĠV’, ‘alky’, ‘ria’] with corresponding IDs: [468, 44068, 6374],
This is just one of the issues forcing me to have multiple entries in my list of to-be-filtered IDs.
I have already had a look at the mask_tokens()
function into the DataCollatorForLanguageModeling class
, which is the function actually masking the tokens during each batch, but I cannot find any efficient and smart way to mask only specific words and their corresponding IDs.