When using Dataset.map to tokenize a dataset, the speed slows down as the progress approaches 100%

Like the title says, initially, the speed was extremely fast, reaching about 80,000 examples/s, so the estimated time was around 5 minutes. However, it actually took almost 50 minutes because as the progress approached 100%, the speed gradually decreased. Around 30%, it dropped to 50,000 examples/s, at 60% it decreased to 30,000 examples/s, and so on. The most severe drop occurred between 90-100%, where the speed plummeted to levels as low as 3,000 to 90 examples/s. Consequently, within those 50 minutes, about 30 to 40 minutes were spent within the 90-100% range.

Additional information:

  • I used LlamaTokenizer (not fast) for tokenization.
  • My dataset consists of approximately 20 million records.
  • The version of datasets used is 2.15.0.
  • Due to this issue, I needed to set multiprocess.set_start_method('spawn') to benefit from multiprocessing.
def _tokenize(batch: dict[str, list[str]], tokenizer: PreTrainedTokenizerBase):
    new_batch = tokenizer(
        [x for x in batch['text'] if x],
        add_special_tokens=False,
        return_token_type_ids=False,
        return_attention_mask=False,
        verbose=False
    )

    for x in new_batch['input_ids']:        
        x.insert(0, tokenizer.bos_token_id)
        x.append(tokenizer.eos_token_id)

    return new_batch

dataset.map(
    _tokenize,
    batched=True,
    remove_columns=True,
    fn_kwargs=dict(tokenizer=tokenizer),
    num_proc=96,
    desc='Tokenize'
)

1 Like

Hi, have you managed to solve the problem?
I encounter the same issue.

Hi ! First note that if the dataset is not heterogeneous / shuffled, there might be places in the data with shorter texts that are faster to tokenize.

Moreover, the way num_proc works is by slicing the dataset and passing each slice to a process to run the map() function. So at the very end of map(), some processes might have finished transforming their slice of data while others are still running, causing the throughput to become lower.

discussion on github: Tokenization slows towards end of dataset · Issue #6734 · huggingface/datasets · GitHub