Datasets map keeps hanging

Describe the bug

Map has been taking extremely long to preprocess my data.

It seems to process 1000 examples (which it does really fast in about 10 seconds), then it hangs for a good 1-2 minutes, before it moves on to the next batch of 1000 examples.

It also keeps eating up my hard drive space for some reason by creating a file named tmp1335llua that is over 300GB.

Trying to set num_proc to be >1 also gives me the following error: NameError: name ‘processor’ is not defined

Please advise on how I could optimise this?

Steps to reproduce the bug

In general, I have been using map as per normal. Here is a snippet of my code:

###########################        DATASET LOADING AND PREP        #########################

def load_custom_dataset(split):
    ds = []
    if split == 'train':
        for dset in args.train_datasets:
            ds.append(load_from_disk(dset))
    if split == 'test':
        for dset in args.test_datasets:
            ds.append(load_from_disk(dset))

    ds_to_return = concatenate_datasets(ds)
    ds_to_return = ds_to_return.shuffle(seed=22)
    return ds_to_return



def prepare_dataset(batch):
    # load and (possibly) resample audio data to 16kHz
    audio = batch["audio"]

    # compute log-Mel input features from input audio array
    batch["input_features"] = processor.feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]
    # compute input length of audio sample in seconds
    batch["input_length"] = len(audio["array"]) / audio["sampling_rate"]

    # optional pre-processing steps
    transcription = batch["sentence"]
    if do_lower_case:
        transcription = transcription.lower()
    if do_remove_punctuation:
        transcription = normalizer(transcription).strip()

    # encode target text to label ids
    batch["labels"] = processor.tokenizer(transcription).input_ids
    return batch

print('DATASET PREPARATION IN PROGRESS...')

# case 3: combine_and_shuffle is true, only train provided
# load train datasets
train_set = load_custom_dataset('train')

 # split dataset
raw_dataset = DatasetDict()
raw_dataset = train_set.train_test_split(test_size = args.test_size, shuffle=True, seed=42)

raw_dataset = raw_dataset.cast_column("audio", Audio(sampling_rate=args.sampling_rate))

print("Before Map:")
print(raw_dataset)

raw_dataset = raw_dataset.map(prepare_dataset, num_proc=1)

print("After Map:")
print(raw_dataset)

Expected behavior

Based on the speed at which map is processing examples, I would expect a 5-6 hours completion for all mapping

However, because it hangs every 1000 examples, I instead roughly estimate it would take about 40 hours!

Moreover, i cant even finish the map because it keeps exponentially eating up my hard drive space

Environment info

  • datasets version: 2.18.0
  • Platform: Windows-10-10.0.22631-SP0
  • Python version: 3.10.14
  • huggingface_hub version: 0.22.2
  • PyArrow version: 15.0.2
  • Pandas version: 2.2.1
  • fsspec version: 2024.2.0