Memory Efficient Dataset Creation for NSP Training

We want to fine tune BERT with Next Sentence Prediction (NSP) objective and we have a list of files which contains the conversations. To prepare the training dataset for the fine tuning, currently we read through all the files, load all conversation sentences into memory, create positive examples for adjacent sentences A and B, like [CLS] A [SEP] B [SEP], and create negative examples by randomly sample two sentences A and B in all conversation sentences.

The Current logic is similar with:

However, this is not memory efficient because it loads all sentences into memory and now we have lots of sentences which cannot fit into memory any more.

Any suggestions to create the dataset for NSP more memory efficiently? The load_dataset APIs look promising, but didn’t figure out how to process the input files to randomly sample sentences for the negative examples.



Instead of generating a dataset with load_dataset, it should be easier to create dataset chunks with Dataset.from_dict, which we can then save to disk with save_to_disk, reload and concatenate to get a memory-mapped dataset.

The code could look as follows:

# distribute files in multiple dirs (chunkify dir) to avoid loading the entire data into a single LineByLineWithSOPTextDataset
from datasets import Dataset, concatenate_datasets

def list_of_dicts_to_dict_of_lists(d):
    dic = d[0]
    keys = dic.keys()
    values = [dic.values() for dic in d]
    return {k: list(v) for k, v in zip(keys, zip(*values))}

chunks = []
for i, file_dir with enumerate(dirs_with_data_files):
    dset = LineByLineWithSOPTextDataset(<tokenizer>, file_dir)
    examples = list_of_dicts_to_dict_of_lists(dset.examples)
    chunk = Dataset.from_dict(examples)
    chunk = Dataset.load_from_disk(chunk.save_to_disk("./chunks_dir/{i}")) # currently `chunk` is in memory, so we save it on disk to make it memory-mapped

final_dset = concatenate_datasets(chunks)