Tokenizer dataset is very slow

This is my tokenizer method. I found that no matter how much batch_size is set, the speed is the same. Tokenizer Spend time even longer than training. How cloud I do. Thanks very much.

def tokenize_function(example):
    return tokenizer(example["sentence1"], truncation=True, max_length = 512)
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True, batch_size = 8)
tokenized_datasets = tokenized_datasets.remove_columns(["sentence1"])

Hi! What tokenizer are you using? What does tokenizer.is_fast return? If the returned value is False, you can set num_proc > 1 to leverage multiprocessing in map. Fast tokenizers use multithreading to process a batch in parallel on a single process by default, so it doesn’t make sense to use num_proc there.

1 Like