.map() function extremely slow

zuzannad1 · September 13, 2023, 9:09am

Hi,
I am preprocessing the Wikipedia dataset. It’s extremely slow, with 12it/s, which totals 140h to process the dataset. Have looked online and no trace of anyone having similar issues.

I am running the script on a Slurm cluster with 128 CPUs, no GPU.

#SBATCH --ntasks=1 --cpus-per-task=128 --mem=50000M
#SBATCH --time=200:00:00

Code - should be reproducible:

import datasets 
from random import randint 
from transformers import AutoTokenizer
import gc

dataset = datasets.load_dataset("wikipedia", "20220301.en", split="train")

gc.enable()

def preprocess_examples(batch):
    documents = batch['text']
    data = {'example': [], 'summary': []}

    for document in documents:
        # Generate two random numbers for length of document and summary 
        doc_length = randint(100, 400) # Patches in the encoder
        sum_length = randint(20, 50) # Tokens in the decoder
        
        document = document.replace('\n','')
        document = document.split(' ')

        if doc_length + sum_length <= len(document):
            text = document[:doc_length]
            summary = document[doc_length:(doc_length+sum_length)]
        else:
            text = document[:int(len(document)*0.8)]
            summary = document[int(len(document)*0.8):]
        
        summary = ' '.join(summary)
        text = ' '.join(text)

        gc.collect()
        data['example'].append(text)
        data['summary'].append(summary)

    return data

if __name__ == "__main__":
    train_dataset = dataset.map(preprocess_examples, batched=True, batch_size=1000, remove_columns=["id", "url", "title", "text"])

I have tried to:

set num_proc to number of cpu cores os.cpu_count() - didn’t improve the speed
set batch size to smaller / bigger number - no effect
try to change .map() to Dataset.from_generator() - no improvement

Any ideas or hints would be greatly appreciated

mariosasko · September 13, 2023, 4:17pm

Why do you need gc.collect()? This call is super expensive. Without it, 1500-2000 examples/s is the processing speed I get in Colab.

Topic		Replies	Views
Extremely slow operation on dataset.map 🤗Datasets	0	292	June 27, 2024
Cannot preprocess wikipedia dataset 🤗Datasets	1	501	June 3, 2023
Map function skipping rows (only 8k out of 1.6M rows) 🤗Datasets	1	195	December 25, 2023
Saving dataset in the current state without cache 🤗Datasets	9	5886	March 17, 2022
When using Dataset.map to tokenize a dataset, the speed slows down as the progress approaches 100% 🤗Datasets	3	884	December 23, 2024

.map() function extremely slow

Related topics