Dataset.map hangs on tokenization (relatively small dataset)

Hello!

My Dataset is not huge at all: num_rows: 198596 in the training set and 24825 in test and valid datasets each.
I have 4 processes in Collab (output of !nproc).

Yet, when I’m running the dataset.map, at some point it hangs and never finishes running. Usually it hangs at the same %. Stopping it and re-running doesn’t help (yet, cached files are loaded properly)

I run dataset.map with the following arguments,
tokenized_ds = dataset.map(preprocess_function, num_proc=4, batched=True, remove_columns=[‘name’])

The tokenizer used in the preprocess_function is AutoTokenizer.from_pretrained("distilbert-base-uncased"), but I doubt it matters.

I would appreciate suggestions!

2 Likes

Hi! Does this only happen if num_proc is not 1 (or None)? Because if that’s not the case, then set num_proc to 1 for easier debugging. Also, feel free to share the (entire) traceback that you get after interrupting the process while waiting for it to finish.

Hi!

Thanks for the reply!

The problem was that my pieces of data were way too big in some rows (up to 15 MB (!)), so the batch didn’t fit in memory and the whole thing was crashing.

The problem was solved one step earlier, by splitting data to some chunks.