Taking long time to start the training

pandaczm · December 13, 2023, 3:15pm

I am pretraining a ALBERT model with 40M data (max_token_length is 64) stored in jsonl file. When using Datasets to load the data, it is fast. But after loading, it will take quite a long time to start the training (training speed is very fast though). Could anyone know what happens to the Dataset between data loading and model training?
code:
train_dataset = load_dataset(‘json’, data_files=train_file_path, split=‘train’)
test_dataset = load_dataset(‘json’, data_files=test_file_path, split=‘train’)
data_collator = DataCollatorForLanguageModeling(
tokenizer=tokenizer, mlm=True, mlm_probability=0.15,
# pad_to_multiple_of=max_num_token
)
…
trainer.train() # when entering this stage, it takes long time (~1 hour) to start the training

mariosasko · December 15, 2023, 2:41pm

Hi! This is the Trainer code that prepares the input dataset for training. Can you interrupt the script and share the error stack trace (so we can see where it gets stuck)?

Topic		Replies	Views
Training speed becoming much slower when using a larger dataset Beginners	0	319	March 31, 2022
Normal amount of time for trainings to get ready? 🤗AutoTrain	2	656	June 4, 2023
Loading Huge Image Dataset seems to take a lot of time 🤗Datasets	7	3742	May 16, 2022
Training with Trainer really slow 🤗Transformers	0	1630	June 12, 2023
Fetching data takes too too much time 🤗Datasets	1	1293	June 13, 2022

Taking long time to start the training

Related topics