How to customize BERT MLM task

Now I want to use my own domain data to train BERT model, but my data has some special characteristics:

  • data length distribution: over 70% of my data has the length shorter than 5 and the largest length is 14;

  • data format: the data is a list of number representing the AS-PATH of BGP announcement.

So if I use this dataset to do BERT MLM task, then according to the default mask method, there may be only one token left in a sentence that is not masked, and all the others are masked, making the model unable to infer.

I want to know how to deal with this problem. THANKS!!

You can take a look at fine-tuning BERT here: Fine-tuning a masked language model - Hugging Face NLP Course.
Here instead of masking examples individually, all examples are concatenated together, chunked into equal length and then mask is created. This can help with training on your shorter data length.

Hi, @saurav. Thanks for your advice. I’ve read that approach and here is my question: if I concat these data into a long text, the position embedding loses its meaning. Is that right?

Not necessarily. Even after concatenating and chunking, the relative position of the words will remain same. We should not think of position embedding as absolute position when we are trying to learn relationship between different parts of the sentence as is the case in BERT.

@saurav Thanks for your patience :pleading_face:. But what about the data semantics? If you concat and chunk, a single sentence will be split into several parts and concat with other part of sentences into a new one. So during this process, the semantics of the original sentence will be destroyed. I don’t know if I understand it correctly.

we generally use special tokens like [SEP] and [CLS] to mark separation between different sentences.
Taking an example from the article above, we can see how these special tokens are used to separate different sentences.

i expect that many adults of my age think that [MASK]mwell [MASK] is [MASK] fetched. what a pity that it isn\'t! [SEP] [CLS] [MASK]ness ( or [MASK]lessness as george 宇in stated )公 been an issue for years but never [MASK] plan to help those on the street that were once considered human [MASK] did everything from going to school

We will loose some semantics of few sentences, but if the data set is large enough, this should not be a concern in overall learnings. We can also try to have some overlap between chunks if semantic loss is a major concern.

@saurav I totally understand what you mean, but in my situation, the average length of my text dataset is only around 4.5. If I want to apply the same approach which is concatenating and chunking, two datas that should not be adjacent are forced to be merged together. Is this a problem? Thanks again for your clear explanation!