How is the dataset loaded?

lucadini · January 17, 2022, 11:54am

Hi everyone,

I’m trying to pre-train BERT on a cluster server, using the classic run_mlm script. I have a dataset of 27M sentences divided into 27 files.
When I was testing my script with 2-5 files, it all worked perfectly, but when I try to use all the dataset, it seems that the execution remains stuck before training, but after dataset caching!
The execution doesn’t stop until the time limit is reached and i get this error:

slurmstepd: error: Detected 2 oom-kill event(s) in StepId=5328394.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.

I thought that the dataset was loaded lazily using the transformers trainer, am I worng? Have you got any suggestion?

Thanks in advance!

mariosasko · January 19, 2022, 3:23pm

Hi! What do you get when you run

print(dset.cache_files)

on your dataset object?

Topic		Replies	Views
`load_from_cache_file` not working 🤗Datasets	1	2159	May 10, 2021
The datasets.map function does not load cached dataset Beginners	7	2264	November 21, 2023
Training speed becoming much slower when using a larger dataset Beginners	0	319	March 31, 2022
Loading Huge Image Dataset seems to take a lot of time 🤗Datasets	7	3743	May 16, 2022
How to continue BERT training 🤗Transformers	1	1346	March 4, 2022

How is the dataset loaded?

Related topics