Mask modelling on specific words

cgt58 · March 22, 2021, 6:45pm

Hello,

I would like to fine-tune a masked language model (based on CamemBert) in order to predict some words in a text or a sentence.

During the training procedure, I want to mask specific words in order to force the model to focus on them. Indeed with the test data, the model will only have to predict these specific words and nothing else.

My concern is that most of the specific words are unknown in the vocabulary and then are tokenized into subtokens. For instance with the sentence: “je rentre bredouille” where the word to mask is “bredouille”. When I tokenize this, it becomes :
[‘▁je’, ‘▁rentre’, ‘▁bre’, ‘d’, ‘ouille’]. How should I handle this ? Should I use the mask like this: [‘▁je’, ‘▁rentre’, ‘MASK’, ‘MASK’, ‘MASK’]? If so, how will the model be able to predict ‘bredouille’ with a single token ?

I have a subsidiary question: if my issue can be solved, how can I used the final trained model in order to make word embeddings ?

Thank you very much,

vitali · March 25, 2021, 4:17am

to deal with vocabulary change, I had to (1) get vocab from the current model tokenizer tokenizer.get_vocab() (2) compare my custom vocab with the vocab of the model tokenizer (3) add my tokens to tokenizer vocab tokenizer.add_tokens(add_vocab) (4) resize the model for updated vocab model.resize_token_embeddings(len(tokenizer)). and cross my fingers that Trainer still works (would be very nice if Trainer could auto-reszise model for updated vocab, instead of crushing)

Topic		Replies	Views
Best way to mask a multi-token word when using `.*ForMaskedLM` models 🤗Tokenizers	2	2301	April 4, 2022
Mask only specific words 🤗Tokenizers	4	3714	November 7, 2021
Masking specific token in each input sentence during Masked language modelling 🤗Transformers	0	1046	October 18, 2021
Fine tunning pretrained bert with new vocabulary Beginners	0	449	October 1, 2020
Why does my MLM model still not output emojis after adding them as special tokens? Beginners	0	423	June 29, 2021

Mask modelling on specific words

Related topics