I would like to fine-tune a masked language model (based on CamemBert) in order to predict some words in a text or a sentence.
During the training procedure, I want to mask specific words in order to force the model to focus on them. Indeed with the test data, the model will only have to predict these specific words and nothing else.
My concern is that most of the specific words are unknown in the vocabulary and then are tokenized into subtokens. For instance with the sentence: “je rentre bredouille” where the word to mask is “bredouille”. When I tokenize this, it becomes :
[‘▁je’, ‘▁rentre’, ‘▁bre’, ‘d’, ‘ouille’]. How should I handle this ? Should I use the mask like this: [‘▁je’, ‘▁rentre’, ‘MASK’, ‘MASK’, ‘MASK’]? If so, how will the model be able to predict ‘bredouille’ with a single token ?
I have a subsidiary question: if my issue can be solved, how can I used the final trained model in order to make word embeddings ?
Thank you very much,