Mask modelling on specific words


I would like to fine-tune a masked language model (based on CamemBert) in order to predict some words in a text or a sentence.

During the training procedure, I want to mask specific words in order to force the model to focus on them. Indeed with the test data, the model will only have to predict these specific words and nothing else.

My concern is that most of the specific words are unknown in the vocabulary and then are tokenized into subtokens. For instance with the sentence: “je rentre bredouille” where the word to mask is “bredouille”. When I tokenize this, it becomes :
[‘▁je’, ‘▁rentre’, ‘▁bre’, ‘d’, ‘ouille’]. How should I handle this ? Should I use the mask like this: [‘▁je’, ‘▁rentre’, ‘MASK’, ‘MASK’, ‘MASK’]? If so, how will the model be able to predict ‘bredouille’ with a single token ?

I have a subsidiary question: if my issue can be solved, how can I used the final trained model in order to make word embeddings ?

Thank you very much,

to deal with vocabulary change, I had to (1) get vocab from the current model tokenizer tokenizer.get_vocab() (2) compare my custom vocab with the vocab of the model tokenizer (3) add my tokens to tokenizer vocab tokenizer.add_tokens(add_vocab) (4) resize the model for updated vocab model.resize_token_embeddings(len(tokenizer)). and cross my fingers that Trainer still works :slight_smile: (would be very nice if Trainer could auto-reszise model for updated vocab, instead of crushing)

1 Like