Pre-Train BERT (from scratch)

prajjwal1 · September 24, 2020, 1:01pm

BERT has been trained on MLM and NSP objective. I wanted to train BERT with/without NSP objective (with NSP in case suggested approach is different). I haven’t performed pre-training in full sense before. Can you please share how to obtain the data (crawl and tokenization details which were used) on which BERT was trained on ?. Since it takes a lot of time, I am looking for well tested code that can yield the BERT with/without NSP in one go. Any suggestions will be helpful.
I know about some projects like these, but they won’t integrate well with transformers well I guess which is a must have condition in my case.

valhalla · September 25, 2020, 6:44am

BERT was trained on book corpus and english wikipedia both of which are available in dataset library

Transformers has recently included dataset for for next sent prediction which you could use

github.com

huggingface/transformers/blob/main/src/transformers/data/datasets/language_modeling.py#L258


      
          # We *usually* want to fill up the entire sequence since we are padding
          # to `block_size` anyways, so short sequences are generally wasted
          # computation. However, we *sometimes*
          # (i.e., short_seq_prob == 0.1 == 10% of the time) want to use shorter
          # sequences to minimize the mismatch between pretraining and fine-tuning.
          # The `target_seq_length` is just a rough target however, whereas
          # `block_size` is a hard limit.
          target_seq_length = max_num_tokens
          if random.random() < short_seq_prob:
              target_seq_length = random.randint(2, max_num_tokens)
          
          
# We DON'T just concatenate all of the tokens from a document into a long
          # sequence and choose an arbitrary split point because this would make the
          # next sentence prediction task too easy. Instead, we split the input into
          # segments "A" and "B" based on the actual "sentences" provided by the user
          # input.
          examples = []
          current_chunk = []  # a buffer stored current working segments
          current_length = 0
          i = 0
          while i < len(document):

and there’s also NSP head for BERT
https://github.com/huggingface/transformers/blob/master/src/transformers/modeling_bert.py#L560

EDIT:
BertForPreTraining class can be used for both MLM and NSP

with the current example/languae-modeling I guess it’s only possible to either use MLM or NSP, you might need to write your own script to combine these.

prajjwal1 · September 25, 2020, 7:39am

For training on MLM objective, is it recommended to use collate_fn from here? Didn’t see TextDataset for MLM objective.

valhalla · September 25, 2020, 7:42am

Masking is done using DataCollatorForLanguageModeling so you can use any dataset and just pass the collator to DataLoader.

One thing to note:
DataCollatorForLanguageModeling does dynamic masking but BERT was trained using static masking .

prajjwal1 · September 25, 2020, 7:52am

It seems that using BertForNextSentencePrediction with TextDatasetForNextSentencePrediction and DataCollatorForLanguageModeling would be equivalent to the BERT objective (except static masking part). And for dataset, we can use datasets.concatenate_datasets() method for BookCorpus and Wikipedia. This might be close right ? Any additional details ?

prajjwal1 · September 25, 2020, 9:10am

datasets.concatenate_datasets() does not seem to work for this since features do not match. Also BertForNextSentencePrediction expects a file_path. Initially I thought it was a wrapper which can take datasets objects.

vblagoje · September 25, 2020, 10:25am

It shouldn’t be hard to convert BertForNextSentencePrediction to use datasets. I played with wikipedia dataset for english just now. Each dataset entry is an article/document and it needs to be sentence tokenized in BertForNextSentencePrediction . Book corpus dataset entries seem to be sentences already. Let me know about your progress.

prajjwal1 · September 25, 2020, 10:27am

How are you measuring the metric ?

vblagoje · September 25, 2020, 10:39am

I don’t yet. I am still setting up these training pipelines. I asked about metrics at Evaluation metrics for BERT-like LMs but no response yet. I read at https://huggingface.co/transformers/perplexity.html and elsewhere that perplexity is not appropriate for BERT and MLMs. Can’t we use fill-mask pipeline and some version of masking accuracy?

OTOH, I’ve already setup GLUE benchmarks with https://jiant.info/ v2 Alpha. Excellent integration with transformers and can easily plugin any model and run benchmarks in parallel. See https://github.com/jiant-dev/jiant/tree/master/examples for more details

prajjwal1 · September 25, 2020, 10:44am

Did you try using Cross Entropy for pre-training ? We usually use that for MLM. It can be easily used for NSP I guess.

lhoestq · September 25, 2020, 1:29pm

Indeed wikipedia has columns “text” and “title” while bookcorpus only has “text”.
You can concatenate them by removing the “title” column from wikipedia:

from datasets import load_dataset, concatenate_datasets

wiki = load_dataset("wikipedia", "20200501.en", split="train")
bookcorpus = load_dataset("bookcorpus", split="train")
print(wiki.column_names, bookcorpus.column_names)
# ['title', 'text'] ['text']

wiki.remove_columns_("title")
bert_dataset = concatenate_datasets([wiki, bookcorpus])

lhoestq · September 25, 2020, 1:33pm

Let me know if you find an appropriate way to cut wikipedia articles into sentences !
Also don’t hesitate if you have any questions about dataset processing, I’d be happy to help

BramVanroy · September 25, 2020, 2:34pm

You can use spaCy or stanza for sentence segmentation. spaCy is quite a bit faster but might be less correct. If you want to I can post a segmentation function here.

prajjwal1 · September 25, 2020, 2:36pm

So after concatenation of wikipedia and book_corpus, next things to do is NSP. Can you suggest how that is to be done on object after concatenation happens?
I do not want to diverge from the actual method which was used to pre-train BERT.

BramVanroy · September 25, 2020, 2:39pm

You can have a look here:

github.com

huggingface/transformers/blob/master/src/transformers/modeling_bert.py#L1196


        )
        input_ids = torch.cat([input_ids, dummy_token], dim=1)

        return {"input_ids": input_ids, "attention_mask": attention_mask}


@add_start_docstrings(
    """Bert Model with a `next sentence prediction (classification)` head on top. """,
    BERT_START_DOCSTRING,
)
class BertForNextSentencePrediction(BertPreTrainedModel):
    def __init__(self, config):
        super().__init__(config)

        self.bert = BertModel(config)
        self.cls = BertOnlyNSPHead(config)

        self.init_weights()

    @add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
    @replace_return_docstrings(output_type=NextSentencePredictorOutput, config_class=_CONFIG_FOR_DOC)

prajjwal1 · September 25, 2020, 2:39pm

Has anyone replicated BERT pre-training from scratch ? It would be good to hear what exactly did they do.

prajjwal1 · September 25, 2020, 2:40pm

I already saw it. I tried using it, but got stuck with other things such as metric, preprocessing etc. Given that training will last for a week, there is not much scope to make errors.

prajjwal1 · September 25, 2020, 2:43pm

Also, is there some study or has anyone experimented what happens if we solely rely on MLM and no NSP. How much difference will that make ? RoBERTa showed that NSP didn’t prove to be useful. In this case, does involving NSP help with MLM ?

BramVanroy · September 25, 2020, 2:51pm

Well as you found, RoBERTa showed that leaving out NSP yields better results on downstream tasks. Albert then re-added a similar (yet very different) task, namely sentence order prediction, which improved performance on downstream tasks.

PS: please don’t post multiple consecutive posts but rather edit your posts to add more information. It’s a bit annoying with the notifications.

vblagoje · September 25, 2020, 3:39pm

Quentin, I am not sure dataset itself should cut articles into sentences (unless there is an option for both articles/sentences). Perhaps other models might need entire articles as input. If needed, users can sentence tokenize articles using nltk/spacy and such. I’ll play with the wikipedia dataset in the coming days and I’ll report back to you my experiences. Also, while looking at the dataset I found references to Categories and such. Perhaps equally important objective for wikipedia dateset is to keep it as clean as possible.

Topic		Replies	Views
How to train BERT from scratch on a new domain for both MLM and NSP? Models	2	2283	February 6, 2021
Pre-Train BERT from scratch 🤗Transformers	5	15296	May 30, 2023
Continual pre-training from an initial checkpoint with MLM and NSP Models	4	4280	September 8, 2021
Original Bert Pretraining Intermediate	0	545	January 10, 2022
BERT Next Sentence Prediction: How to do predictions? Beginners	5	7512	September 29, 2022

Pre-Train BERT (from scratch)

Related topics