Dataset.map hangs on tokenization (relatively small dataset)

Lyma · April 7, 2022, 4:35pm

Hello!

My Dataset is not huge at all: num_rows: 198596 in the training set and 24825 in test and valid datasets each.
I have 4 processes in Collab (output of !nproc).

Yet, when I’m running the dataset.map, at some point it hangs and never finishes running. Usually it hangs at the same %. Stopping it and re-running doesn’t help (yet, cached files are loaded properly)

I run dataset.map with the following arguments,
tokenized_ds = dataset.map(preprocess_function, num_proc=4, batched=True, remove_columns=[‘name’])

The tokenizer used in the preprocess_function is AutoTokenizer.from_pretrained("distilbert-base-uncased"), but I doubt it matters.

I would appreciate suggestions!

mariosasko · April 8, 2022, 4:46pm

Hi! Does this only happen if num_proc is not 1 (or None)? Because if that’s not the case, then set num_proc to 1 for easier debugging. Also, feel free to share the (entire) traceback that you get after interrupting the process while waiting for it to finish.

Lyma · April 22, 2022, 9:35am

Hi!

Thanks for the reply!

The problem was that my pieces of data were way too big in some rows (up to 15 MB (!)), so the batch didn’t fit in memory and the whole thing was crashing.

The problem was solved one step earlier, by splitting data to some chunks.

Topic		Replies	Views
Datasets map keeps hanging Beginners	0	673	April 7, 2024
Dataset map function takes forever to run! 🤗Datasets	16	6647	August 15, 2024
When using Dataset.map to tokenize a dataset, the speed slows down as the progress approaches 100% 🤗Datasets	3	890	December 23, 2024
Num_proc is not working with map Beginners	5	2195	April 15, 2024
Progress bar of dataset.map with num_proc>1 hangs 🤗Datasets	2	1258	December 6, 2023

Dataset.map hangs on tokenization (relatively small dataset)

Related topics