Training speed becoming much slower when using a larger dataset

yijiulanpishu · March 31, 2022, 3:25pm

I am training a MLM Bert from scratch.

When i used the dataset A which has about 14M sentences (~2G), the training speed for each iteration was acceptable. However, when i changed to a larger dataset B which has about 54M sentences (~10G), the training speed for each iteration became much slower (only 1/3 of before, i guess).

I am quite confused since i only did one line modification for the code:
raw_datasets = datasets.load_dataset(“text”, data_files=data_files_A,) -----> raw_datasets = datasets.load_dataset(“text”, data_files=data_files_B,)

The configurations (the model, tokenizer, max_sequence_length and …) keeps the same as before. And also the dataset “.txt” files were obtained by a same py script and has the same format.

Am I missing something important?

Topic		Replies	Views
How is the dataset loaded? Beginners	1	361	January 19, 2022
Taking long time to start the training 🤗Datasets	1	785	December 15, 2023
Bert LM pretraining: training loss goes to 0 at masking probability of 0.999 Beginners	2	2321	October 31, 2020
Slow speed when using a fine-tuned bert for prediction Beginners	0	2164	March 26, 2022
Inference time gets slower as dataset size increase 🤗Transformers	0	433	February 23, 2023

Training speed becoming much slower when using a larger dataset

Related topics