How is the dataset loaded?

Hi everyone,

I’m trying to pre-train BERT on a cluster server, using the classic run_mlm script. I have a dataset of 27M sentences divided into 27 files.
When I was testing my script with 2-5 files, it all worked perfectly, but when I try to use all the dataset, it seems that the execution remains stuck before training, but after dataset caching!
The execution doesn’t stop until the time limit is reached and i get this error:

slurmstepd: error: Detected 2 oom-kill event(s) in StepId=5328394.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.

I thought that the dataset was loaded lazily using the transformers trainer, am I worng? Have you got any suggestion?

Thanks in advance!

Hi! What do you get when you run

print(dset.cache_files)

on your dataset object?