Memory Efficient Dataset Creation for NSP Training

tinglin · November 30, 2021, 9:17am

We want to fine tune BERT with Next Sentence Prediction (NSP) objective and we have a list of files which contains the conversations. To prepare the training dataset for the fine tuning, currently we read through all the files, load all conversation sentences into memory, create positive examples for adjacent sentences A and B, like [CLS] A [SEP] B [SEP], and create negative examples by randomly sample two sentences A and B in all conversation sentences.

The Current logic is similar with:

github.com

huggingface/transformers/blob/master/src/transformers/data/datasets/language_modeling.py#L196

    
      
                  for i in range(n):
                      self.examples[i]["chinese_ref"] = torch.tensor(ref[i], dtype=torch.long)
          
          
    def __len__(self):
                  return len(self.examples)
          
          
    def __getitem__(self, i) -> Dict[str, torch.tensor]:
                  return self.examples[i]
          
          

          
class LineByLineWithSOPTextDataset(Dataset):
              """
              Dataset for sentence order prediction task, prepare sentence pairs for SOP task
              """
          
          
    def __init__(self, tokenizer: PreTrainedTokenizer, file_dir: str, block_size: int):
                  warnings.warn(
                      DEPRECATION_WARNING.format(
                          "https://github.com/huggingface/transformers/blob/master/examples/pytorch/language-modeling/run_mlm.py"
                      ),
                      FutureWarning,

However, this is not memory efficient because it loads all sentences into memory and now we have lots of sentences which cannot fit into memory any more.

Any suggestions to create the dataset for NSP more memory efficiently? The load_dataset APIs look promising, but didn’t figure out how to process the input files to randomly sample sentences for the negative examples.

Thanks

mariosasko · December 7, 2021, 3:13pm

Hi,

Instead of generating a dataset with load_dataset, it should be easier to create dataset chunks with Dataset.from_dict, which we can then save to disk with save_to_disk, reload and concatenate to get a memory-mapped dataset.

The code could look as follows:

# distribute files in multiple dirs (chunkify dir) to avoid loading the entire data into a single LineByLineWithSOPTextDataset
from datasets import Dataset, concatenate_datasets

def list_of_dicts_to_dict_of_lists(d):
    dic = d[0]
    keys = dic.keys()
    values = [dic.values() for dic in d]
    return {k: list(v) for k, v in zip(keys, zip(*values))}

chunks = []
for i, file_dir with enumerate(dirs_with_data_files):
    dset = LineByLineWithSOPTextDataset(<tokenizer>, file_dir)
    examples = list_of_dicts_to_dict_of_lists(dset.examples)
    chunk = Dataset.from_dict(examples)
    chunk = Dataset.load_from_disk(chunk.save_to_disk("./chunks_dir/{i}")) # currently `chunk` is in memory, so we save it on disk to make it memory-mapped
    chunks.append(chunk)

final_dset = concatenate_datasets(chunks)

Topic		Replies	Views
BERT Next Sentence Prediction: How to do predictions? Beginners	5	7452	September 29, 2022
Pre-Train BERT (from scratch) Research	43	18933	June 27, 2022
Sentence Order Prediction - Dataset Creation 🤗Datasets	1	675	October 21, 2021
Continual pre-training from an initial checkpoint with MLM and NSP Models	4	4255	September 8, 2021
Sentence Transformer Fine-Tuning Dataset Curation Clarification Beginners	0	555	September 3, 2022

Memory Efficient Dataset Creation for NSP Training

Related topics