I am training a MLM Bert from scratch.
When i used the dataset A which has about 14M sentences (~2G), the training speed for each iteration was acceptable. However, when i changed to a larger dataset B which has about 54M sentences (~10G), the training speed for each iteration became much slower (only 1/3 of before, i guess).
I am quite confused since i only did one line modification for the code:
raw_datasets = datasets.load_dataset(“text”, data_files=data_files_A,) -----> raw_datasets = datasets.load_dataset(“text”, data_files=data_files_B,)
The configurations (the model, tokenizer, max_sequence_length and …) keeps the same as before. And also the dataset “.txt” files were obtained by a same py script and has the same format.
Am I missing something important?