Pre-Train BERT (from scratch)

I agree we should keep wikipedia as is, i.e. with full articles.
The idea to let full power to the users to do whatever they want with the datasets.

Let me know how it goes !

2 Likes

Update on BERT training from scratch. I have figured out how to load wikipedia dataset and it’s 17 Gb of data :slight_smile: I can shard this large dataset and create wikipedia sharded datasets to feed into BERT model. The issue I have now is not knowing how to properly and continually feed these sharded datesets into the Trainer.

I tried:

for shard_id in range(num_shards):
  train_sub_dataset = dataset.shard(num_shards=num_shards, index=shard_id)  
  print(f"Creating new shard of length {len(train_sub_dataset)}")
  train_dataset = WikipediaDatasetForNextSentencePrediction(tokenizer=tokenizer, dataset=train_sub_dataset, block_size=128)
  trainer.train_dataset = train_dataset    
  trainer.train()

For some reason this approach screws up learning rate (and possibly other Trainer states). However, if I create a new Trainer for each loop run that’s going to initialize a fresh trainer. How can I continue training by just continualy feeding these sharded datasets and keep all the trainer state intact?

[Edit]: Aha, I see, if I just pass the mode_path to train method, it should work.

[More Edit]: Although I made the checkpoint loading, the training will not work as is because I changed the training dataset in the loop above. Therefore, each new training dataset is immediately skipped as we have already traversed the current dataset/dataloader index in the global state. @lhoestq

Hi @vblagoje
Why do you need to shard wikipedia ?

Quentin, because I don’t think I can load all these training examples in memory. DataCollatorForLanguageModeling can not accept an additional dimension for articles. The fill mask operation fails in DataCollatorForLanguageModeling as it expects only two dimensions (batch, sentence). I can make a dataset having articles and sentences dimension without eager loading directly from the Wikipedia dataset but it needs an appropriate collator. Perhaps, I can try one of these other collators available? Thoughts?

Since the wikipedia dataset lives on disk, you should be able to run everything without RAM issues.

To use the DataCollatorForLanguageModeling I think you’ll need to preprocess the dataset by cutting articles into sentences and then tokenize them. The processed dataset will also live on disk so you shouldn’t run into memory issues.

To chunk the articles you can check https://huggingface.co/docs/datasets/processing.html#augmenting-the-dataset

3 Likes

Ok, didn’t know about the disk! Processing 3 billion word corpus is going to take a while no matter how you cut it. Ideally we could use a custom Dataset variant that does pre-processing (in this case tokenizing articles/sentences) in the background while feeding the trainer with already processed data.

Indeed preprocessing can be done on the fly during training. That’s one of the purposes of Data collators :slight_smile: I guess you could write your own Data collator by getting inspiration from the DataCollatorForLanguageModeling one.

However if you plan to do several experiments with the same dataset and same processing, I think it’s worth doing the preprocessing before training.

1 Like

So is it better to write a data collator for NSP to be able to perform NSP on datasets object ? But if one were to do pre-training, how are we supposed to use data collator for LM and NSP simultaneously ?

Do we create a wrapper over dataset for NSP (dataset class) ? What’s the efficient way to go about it ? There seems to be so many variables which might make replication hard.

I believe you have to use BertForPreTraining on top of your BERT model. The output is sum the loss from the NSP loss and the MLM loss.

The Wikipedia dataset Quentin and I talked about enables you to read cleaned wikipedia articles easily but it’s up to you (i.e. your custom dataset) to extract out sentences from articles and feed them to the above model.

[Edit] The task of the collator is to do additional transformation and preparation of the articles and encoded sentences before they are fed into the model. So if you look at DataCollatorForLanguageModeling it does token masking for you. However, DataCollatorForNextSentencePrediction does the next sentence model input alignment and next sentence labeling in addition to masking. Amazing work.

have you seen this colab notebook?

This is helpful. Thanks.

How exactly are we supposed to use DataCollatorForNextSentencePrediction ? I am performing tokenization on fly as __getitem__ is called. How to perform splitting exactly for this ? This is written The input should contain negative examples, in the docstring, which implies that splitting criteria should add negative samples.

EDIT: Found this dataset class but it won’t work with datasets.

Hey everyone,

I was able to use HF datasets for BERT pre-training. The conversion of TextDatasetForNextSentencePrediction from file based input to wikipedia HF datasets is not that hard. I also used features caching but not on entire 18 Gb dataset but on its 1/100 shard. Processing 1/100 shard of the wikipedia datasets takes ~30 min using really good hardware, and fast tokenizer, and it’s great that features can be loaded in the next run but I imagine processing entire dataset would take prohibitevely long, if possible at all. So I played with the single shard just to try how everything works. And the training works which is great.

However, I still want to somehow create train loader(s) on the entire wikipedia dataset. Nvdia’s BERT pytorch training loop shards input files with HDF5 and creates new train loader every time it goes through all the data input examples of the current train loader. Does anyone know how to do this with HF Trainer? I’ve seen that @julien-c worked on data collators and leanguage training in general. If anyone else has some ideas - please let us know.

If you use nlp datasets and do the tokenization in the dataset (rather than in a collate_fn), you only have to tokenize all the data once. The tokenized datasets are then automatically cached (saved to disk) and loaded on the fly whenever necessary.

Thanks for your response Bram. That’s exactly what I am doing. The final cached features file will be around 15Gb. How is pickle save/load able to handle that amount of data?

If I understand correctly, the nlp library’s doesn’t pickle the data in a conventional manner. Rather it uses an apache Arrow data format which is more similar to databases/data tables. I still don’t quite understand how it works so well - it is very fast without loading all examples in memory - but I’m, sure that @lhoestq can answer that question a lot better than I can.

Right, I understand the datasets library (formerly known as nlp) uses arrow format and is memory-mapped, but cache features are a different beast. All the examples I’ve seen simply load/save cache features using either pickle or torch load/save. How are they handled?

But that’s the thing: the features are the dataset. For instance, with the datasets library you can create a dataset based off features. I’ll give an example below, but the point is that you can just program your feature creation with dataset.map. This will only need to be run once, and the results are cached on disk as Arrow data. In future calls to dataset, the map call will be automatically skipped and the preprocessed data is directly loaded.

In the example below, an input file dataset_f is processed (tokenized, only get input_ids), divided into train/dev/test sets, and its respective dataloaders are returned.

def prepare_data(dataset_f: str,
                 tokenizer: PreTrainedTokenizer,
                 batch_size: int = 64,
                 num_workers: int = 0) -> Dict[str, DataLoader]:
    """Given an input file, prepare the train, test, validation dataloaders.
       The created datasets will be preprocessed and save to disk.
    :param dataset_f: input file
    :param tokenizer: pretrained tokenizer that will prepare the data, i.e. convert tokens into IDs
    :param batch_size: batch size for the dataloaders
    :param num_workers: number of CPU workers to use during dataloading. On Windows this must be zero
    :return: a dictionary containing train, test, validation dataloaders
    """

    def collate(batch: List[Dict[str, Tensor]]) -> Dict[str, Tensor]:
        """Collates gathered items to form a batch which is then used in training, evaluation, or testing.
        :param batch: a list of samples from the dataset. Each sample is a dictionary with keys "input_ids".
        :return: the created batch with keys "input_ids"
        """
        all_input_ids = pad_sequence([item["input_ids"] for item in batch]).to(torch.long)

        return {"input_ids": all_input_ids}

    def preprocess(sentences: List[str]) -> Dict[str, Union[list, Tensor]]:
        """Preprocess the raw input sentences from the text file.
        :param sentences: a list of sentences (strings)
        :return: a dictionary of "input_ids"
        """
        tokens = [s.split() for s in sentences]

        # The sequences are not padded here. we leave that to the dataloader in collate
        # That means: a bit slower processing, but a smaller saved dataset size
        return tokenizer(tokens,
                              add_special_tokens=False,
                              is_pretokenized=True,
                              return_token_type_ids=False,
                              return_attention_mask=False)
    
    dataset = Dataset.from_dict({"text": Path(dataset_f).read_text(encoding="utf-8").splitlines()})

    # Split the dataset into train, test, dev
    # 90% (train), 10% (test + validation)
    train_testvalid = dataset.train_test_split(test_size=0.1)
    # 10% of total (test), 10% of total (validation)
    test_valid = train_testvalid["test"].train_test_split(test_size=0.5)

    dataset = DatasetDict({"train": train_testvalid["train"],
                           "test": test_valid["test"],
                           "valid": test_valid["train"]})

    dataset = dataset.map(preprocess, input_columns=["text"], batched=True)
    dataset.set_format("torch", columns=["input_ids"])

    return {partition: DataLoader(ds,
                                  batch_size=batch_size,
                                  shuffle=True,
                                  collate_fn=collate,
                                  num_workers=num_workers,
                                  pin_memory=True) for partition, ds in dataset.items()}
2 Likes

Amazing, I get it now! Thanks a bunch @BramVanroy !

1 Like

@prajjwal1 I am doing the same task, were you able to successfully able to complete the training of BERT.