Hello!
My Dataset is not huge at all: num_rows: 198596 in the training set and 24825 in test and valid datasets each.
I have 4 processes in Collab (output of !nproc).
Yet, when I’m running the dataset.map, at some point it hangs and never finishes running. Usually it hangs at the same %. Stopping it and re-running doesn’t help (yet, cached files are loaded properly)
I run dataset.map with the following arguments,
tokenized_ds = dataset.map(preprocess_function, num_proc=4, batched=True, remove_columns=[‘name’])
The tokenizer used in the preprocess_function is AutoTokenizer.from_pretrained("distilbert-base-uncased")
, but I doubt it matters.
I would appreciate suggestions!