Datasets map keeps hanging

Nsohko · April 7, 2024, 3:47am

Describe the bug

Map has been taking extremely long to preprocess my data.

It seems to process 1000 examples (which it does really fast in about 10 seconds), then it hangs for a good 1-2 minutes, before it moves on to the next batch of 1000 examples.

It also keeps eating up my hard drive space for some reason by creating a file named tmp1335llua that is over 300GB.

Trying to set num_proc to be >1 also gives me the following error: NameError: name ‘processor’ is not defined

Please advise on how I could optimise this?

Steps to reproduce the bug

In general, I have been using map as per normal. Here is a snippet of my code:

###########################        DATASET LOADING AND PREP        #########################

def load_custom_dataset(split):
    ds = []
    if split == 'train':
        for dset in args.train_datasets:
            ds.append(load_from_disk(dset))
    if split == 'test':
        for dset in args.test_datasets:
            ds.append(load_from_disk(dset))

    ds_to_return = concatenate_datasets(ds)
    ds_to_return = ds_to_return.shuffle(seed=22)
    return ds_to_return



def prepare_dataset(batch):
    # load and (possibly) resample audio data to 16kHz
    audio = batch["audio"]

    # compute log-Mel input features from input audio array
    batch["input_features"] = processor.feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]
    # compute input length of audio sample in seconds
    batch["input_length"] = len(audio["array"]) / audio["sampling_rate"]

    # optional pre-processing steps
    transcription = batch["sentence"]
    if do_lower_case:
        transcription = transcription.lower()
    if do_remove_punctuation:
        transcription = normalizer(transcription).strip()

    # encode target text to label ids
    batch["labels"] = processor.tokenizer(transcription).input_ids
    return batch

print('DATASET PREPARATION IN PROGRESS...')

# case 3: combine_and_shuffle is true, only train provided
# load train datasets
train_set = load_custom_dataset('train')

 # split dataset
raw_dataset = DatasetDict()
raw_dataset = train_set.train_test_split(test_size = args.test_size, shuffle=True, seed=42)

raw_dataset = raw_dataset.cast_column("audio", Audio(sampling_rate=args.sampling_rate))

print("Before Map:")
print(raw_dataset)

raw_dataset = raw_dataset.map(prepare_dataset, num_proc=1)

print("After Map:")
print(raw_dataset)

Expected behavior

Based on the speed at which map is processing examples, I would expect a 5-6 hours completion for all mapping

However, because it hangs every 1000 examples, I instead roughly estimate it would take about 40 hours!

Moreover, i cant even finish the map because it keeps exponentially eating up my hard drive space

Environment info

datasets version: 2.18.0
Platform: Windows-10-10.0.22631-SP0
Python version: 3.10.14
huggingface_hub version: 0.22.2
PyArrow version: 15.0.2
Pandas version: 2.2.1
fsspec version: 2024.2.0

Topic		Replies	Views
Dataset.map hangs on tokenization (relatively small dataset) 🤗Datasets	2	1975	April 22, 2022
Using num_proc>1 in Dataset.map hangs 🤗Datasets	8	3954	August 19, 2024
Map() function freezes on large dataset 🤗Datasets	8	2982	September 10, 2023
Progress bar of dataset.map with num_proc>1 hangs 🤗Datasets	2	1254	December 6, 2023
Dataset map function takes forever to run! 🤗Datasets	16	6638	August 15, 2024

Datasets map keeps hanging

Describe the bug

Steps to reproduce the bug

Expected behavior

Environment info

Related topics