.map - function overloads my Cache

Luan77777 · August 20, 2023, 1:15pm

Hi im currently trying to map about 5000 audio files into my dataset in aws Sagemaker. When i try to do that, my Kernel crashes after a few minutes, i assume that the its overloading my chache memory… When i use map with only 1000 files it works fine! Does anybody know what i can do about that? That is my code and the function i you to process mydata:

train_dataset = train_dataset.map(preprocess_function, batched=True, batch_size=None, keep_in_memory=True, load_from_cache_file=False) def preprocess_function(examples):

def preprocess_function(examples):

   audio_arrays = [list(x["array"]) for x in examples["audio"]]

   max_length_audio = max(len(audio) for audio in audio_arrays)

   audio_arrays_padded = [np.pad(audio, (0, max_length_audio - len(audio))) if len(audio) < max_length_audio else audio[:max_length_audio] for audio in audio_arrays]

   print(audio_arrays_padded[0][:10])

   text_list = examples['transcription']

   input_data = processor(
       audio=audio_arrays_padded,
       text_target=text_list,
       sampling_rate=16000,
       return_tensors='pt',
       return_attention_mask=True,
       padding='longest'
   )

print(input_data)
print(input_data['input_values'].shape)
print(input_data['attention_mask'].shape)
print(input_data['labels'].shape)
print(input_data['decoder_attention_mask'].shape)

# return 'input_values': input_data['input_values'], 'attention_mask': input_data['attention_mask'], 'labels': input_data['labels'], 'decoder_attention_mask': input_data['decoder_attention_mask']
return {"input_values": input_data['input_values'],
        "attention_mask": input_data['attention_mask'],
        "labels": input_data['labels'],
        "decoder_attention_mask": input_data['decoder_attention_mask']}

lhoestq · August 21, 2023, 9:38am

You must be running out of memory.

You’re using batched map with batch_size=None which loads the full dataset in a single batch.

Instead you can specify a batch size that won’t load everything at once to avoid running out lf memory.

Moreover regarding padding, I’d suggest to not pad the examples during map, but instead in your collate_fn during training. You only need to pad to the longest of a training batch, no need to pad on the longest of the full dataset.

Luan77777 · August 21, 2023, 12:50pm

Thx for the reply! By padding in the collate_fn do you mean that i should use a datacollator for example, where i pad my data and then create the batch size in the training args? Do you maybe have a code example for this? It would really help me!

lhoestq · August 21, 2023, 1:50pm

Yup the data collator can take a batch as input and apply the padding

I don’t have a code example but you can check the DataCollatorWithPadding that does tokenization + padding for example: Data Collator

Topic		Replies	Views
Running out of memory during dataset.map() with `AutoFeatureExtractor.from_pretrained("facebook/hubert-large-ls960-ft")` Beginners	3	3480	June 8, 2022
How to load this simple audio data set and use dataset.map without memory issues? 🤗Datasets	12	4244	December 10, 2024
Datasets map keeps hanging Beginners	0	674	April 7, 2024
Using a generator for the map function in an iterable dataset Beginners	1	426	December 30, 2023
Working with large datasets - cache issues 🤗Datasets	1	1027	June 1, 2022

.map - function overloads my Cache

Related topics