How to customize BERT MLM task

JackieShi · September 27, 2023, 6:58am

Now I want to use my own domain data to train BERT model, but my data has some special characteristics:

data length distribution: over 70% of my data has the length shorter than 5 and the largest length is 14;
data format: the data is a list of number representing the AS-PATH of BGP announcement.

So if I use this dataset to do BERT MLM task, then according to the default mask method, there may be only one token left in a sentence that is not masked, and all the others are masked, making the model unable to infer.

I want to know how to deal with this problem. THANKS!!

saurav · September 27, 2023, 10:24am

You can take a look at fine-tuning BERT here: Fine-tuning a masked language model - Hugging Face NLP Course.
Here instead of masking examples individually, all examples are concatenated together, chunked into equal length and then mask is created. This can help with training on your shorter data length.

JackieShi · September 27, 2023, 11:08am

Hi, @saurav. Thanks for your advice. I’ve read that approach and here is my question: if I concat these data into a long text, the position embedding loses its meaning. Is that right?

saurav · September 27, 2023, 11:40am

Not necessarily. Even after concatenating and chunking, the relative position of the words will remain same. We should not think of position embedding as absolute position when we are trying to learn relationship between different parts of the sentence as is the case in BERT.

JackieShi · September 27, 2023, 12:01pm

@saurav Thanks for your patience . But what about the data semantics? If you concat and chunk, a single sentence will be split into several parts and concat with other part of sentences into a new one. So during this process, the semantics of the original sentence will be destroyed. I don’t know if I understand it correctly.

saurav · September 27, 2023, 12:50pm

we generally use special tokens like [SEP] and [CLS] to mark separation between different sentences.
Taking an example from the article above, we can see how these special tokens are used to separate different sentences.

i expect that many adults of my age think that [MASK]mwell [MASK] is [MASK] fetched. what a pity that it isn\'t! [SEP] [CLS] [MASK]ness ( or [MASK]lessness as george 宇in stated )公 been an issue for years but never [MASK] plan to help those on the street that were once considered human [MASK] did everything from going to school

We will loose some semantics of few sentences, but if the data set is large enough, this should not be a concern in overall learnings. We can also try to have some overlap between chunks if semantic loss is a major concern.

JackieShi · September 27, 2023, 2:08pm

@saurav I totally understand what you mean, but in my situation, the average length of my text dataset is only around 4.5. If I want to apply the same approach which is concatenating and chunking, two datas that should not be adjacent are forced to be merged together. Is this a problem? Thanks again for your clear explanation!

Topic		Replies	Views
Sequence Length in Continued Pretraining (MLM) & Masking Strategies Intermediate	0	1182	January 6, 2022
Fine-tuning a masked language model Beginners	0	355	February 2, 2022
Fine tune Masked Language Model on custom dataset Beginners	5	6064	August 20, 2020
Does BERT Use Two Segments of a Sequence When Predicting the Masks? Beginners	0	313	June 14, 2022
Chunks and batches in MLMs Beginners	1	1755	June 22, 2023

How to customize BERT MLM task

Related topics