When using Dataset.map to tokenize a dataset, the speed slows down as the progress approaches 100%

ShinoharaHare · December 14, 2023, 4:53pm

Like the title says, initially, the speed was extremely fast, reaching about 80,000 examples/s, so the estimated time was around 5 minutes. However, it actually took almost 50 minutes because as the progress approached 100%, the speed gradually decreased. Around 30%, it dropped to 50,000 examples/s, at 60% it decreased to 30,000 examples/s, and so on. The most severe drop occurred between 90-100%, where the speed plummeted to levels as low as 3,000 to 90 examples/s. Consequently, within those 50 minutes, about 30 to 40 minutes were spent within the 90-100% range.

Additional information:

I used LlamaTokenizer (not fast) for tokenization.
My dataset consists of approximately 20 million records.
The version of datasets used is 2.15.0.
Due to this issue, I needed to set multiprocess.set_start_method('spawn') to benefit from multiprocessing.

def _tokenize(batch: dict[str, list[str]], tokenizer: PreTrainedTokenizerBase):
    new_batch = tokenizer(
        [x for x in batch['text'] if x],
        add_special_tokens=False,
        return_token_type_ids=False,
        return_attention_mask=False,
        verbose=False
    )

    for x in new_batch['input_ids']:        
        x.insert(0, tokenizer.bos_token_id)
        x.append(tokenizer.eos_token_id)

    return new_batch

dataset.map(
    _tokenize,
    batched=True,
    remove_columns=True,
    fn_kwargs=dict(tokenizer=tokenizer),
    num_proc=96,
    desc='Tokenize'
)

Eldudeco · March 18, 2024, 9:33am

Hi, have you managed to solve the problem?
I encounter the same issue.

lhoestq · March 18, 2024, 11:15am

Hi ! First note that if the dataset is not heterogeneous / shuffled, there might be places in the data with shorter texts that are faster to tokenize.

Moreover, the way num_proc works is by slicing the dataset and passing each slice to a process to run the map() function. So at the very end of map(), some processes might have finished transforming their slice of data while others are still running, causing the throughput to become lower.

discussion on github: Tokenization slows towards end of dataset · Issue #6734 · huggingface/datasets · GitHub

cduk · December 23, 2024, 10:27pm

Could it have been due to your storage medium e.g. SSDs with DRAM and SLC caches which get full and then slow down as writes have to go directly to slower NAND.

Topic		Replies	Views
Tokenizer dataset is very slow 🤗Tokenizers	3	4316	March 2, 2024
Dataset map function takes forever to run! 🤗Datasets	16	6639	August 15, 2024
Extremely slow operation on dataset.map 🤗Datasets	0	297	June 27, 2024
Improve performance IterableDataset (with tokenization) 🤗Datasets	2	769	November 2, 2023
Dataset.map hangs on tokenization (relatively small dataset) 🤗Datasets	2	1975	April 22, 2022

When using Dataset.map to tokenize a dataset, the speed slows down as the progress approaches 100%

Related topics