Describe the bug
Map has been taking extremely long to preprocess my data.
It seems to process 1000 examples (which it does really fast in about 10 seconds), then it hangs for a good 1-2 minutes, before it moves on to the next batch of 1000 examples.
It also keeps eating up my hard drive space for some reason by creating a file named tmp1335llua that is over 300GB.
Trying to set num_proc to be >1 also gives me the following error: NameError: name ‘processor’ is not defined
Please advise on how I could optimise this?
Steps to reproduce the bug
In general, I have been using map as per normal. Here is a snippet of my code:
########################### DATASET LOADING AND PREP #########################
def load_custom_dataset(split):
ds = []
if split == 'train':
for dset in args.train_datasets:
ds.append(load_from_disk(dset))
if split == 'test':
for dset in args.test_datasets:
ds.append(load_from_disk(dset))
ds_to_return = concatenate_datasets(ds)
ds_to_return = ds_to_return.shuffle(seed=22)
return ds_to_return
def prepare_dataset(batch):
# load and (possibly) resample audio data to 16kHz
audio = batch["audio"]
# compute log-Mel input features from input audio array
batch["input_features"] = processor.feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]
# compute input length of audio sample in seconds
batch["input_length"] = len(audio["array"]) / audio["sampling_rate"]
# optional pre-processing steps
transcription = batch["sentence"]
if do_lower_case:
transcription = transcription.lower()
if do_remove_punctuation:
transcription = normalizer(transcription).strip()
# encode target text to label ids
batch["labels"] = processor.tokenizer(transcription).input_ids
return batch
print('DATASET PREPARATION IN PROGRESS...')
# case 3: combine_and_shuffle is true, only train provided
# load train datasets
train_set = load_custom_dataset('train')
# split dataset
raw_dataset = DatasetDict()
raw_dataset = train_set.train_test_split(test_size = args.test_size, shuffle=True, seed=42)
raw_dataset = raw_dataset.cast_column("audio", Audio(sampling_rate=args.sampling_rate))
print("Before Map:")
print(raw_dataset)
raw_dataset = raw_dataset.map(prepare_dataset, num_proc=1)
print("After Map:")
print(raw_dataset)
Expected behavior
Based on the speed at which map is processing examples, I would expect a 5-6 hours completion for all mapping
However, because it hangs every 1000 examples, I instead roughly estimate it would take about 40 hours!
Moreover, i cant even finish the map because it keeps exponentially eating up my hard drive space
Environment info
datasets
version: 2.18.0- Platform: Windows-10-10.0.22631-SP0
- Python version: 3.10.14
huggingface_hub
version: 0.22.2- PyArrow version: 15.0.2
- Pandas version: 2.2.1
fsspec
version: 2024.2.0