Hi everyone,
I’m trying to pre-train BERT on a cluster server, using the classic run_mlm script. I have a dataset of 27M sentences divided into 27 files.
When I was testing my script with 2-5 files, it all worked perfectly, but when I try to use all the dataset, it seems that the execution remains stuck before training, but after dataset caching!
The execution doesn’t stop until the time limit is reached and i get this error:
slurmstepd: error: Detected 2 oom-kill event(s) in StepId=5328394.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.
I thought that the dataset was loaded lazily using the transformers trainer, am I worng? Have you got any suggestion?
Thanks in advance!