Fine-tuning BERT with deterministic masking instead of random masking

abbassix · April 22, 2024, 10:49am

I want to fine-tune BERT on a specific dataset. My problem is that I do not want to mask some tokens of my training dataset randomly, but I already have chosen which tokens I want to mask (for certain reasons).

To do so, I created a dataset that has two columns: text in which some tokens have been replaced with [MASK] (I am aware of the fact that some words could be tokenised with more than one token and I took care of that) and label where I have the whole text.

Now I want to fine-tune a BERT model (say, bert-base-uncased) using Hugging Face’s transformers library, but I do not want to use DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.2) where the masking is done randomly and I only can control the probability. What can I do?

Topic		Replies	Views
Using a dataset with already masked tokens Beginners	2	702	February 3, 2021
Where in the code does masking of tokens happen when pretraining BERT Beginners	5	7279	August 17, 2020
Fine tune Masked Language Model on custom dataset Beginners	5	6068	August 20, 2020
BertForMaskedLM training from scratch 🤗Transformers	0	1047	April 7, 2023
Is masking still used when finetuning a BERT model? Beginners	1	1322	July 29, 2020

Fine-tuning BERT with deterministic masking instead of random masking

Related topics