Taking long time to start the training

I am pretraining a ALBERT model with 40M data (max_token_length is 64) stored in jsonl file. When using Datasets to load the data, it is fast. But after loading, it will take quite a long time to start the training (training speed is very fast though). Could anyone know what happens to the Dataset between data loading and model training?
code:
train_dataset = load_dataset(‘json’, data_files=train_file_path, split=‘train’)
test_dataset = load_dataset(‘json’, data_files=test_file_path, split=‘train’)
data_collator = DataCollatorForLanguageModeling(
tokenizer=tokenizer, mlm=True, mlm_probability=0.15,
# pad_to_multiple_of=max_num_token
)
…
trainer.train() # when entering this stage, it takes long time (~1 hour) to start the training

Hi! This is the Trainer code that prepares the input dataset for training. Can you interrupt the script and share the error stack trace (so we can see where it gets stuck)?