Tokenizer performance is slow, after call to dataset map

sgwork · June 15, 2024, 10:25am

Hi,

I have the following code

from transformers import AutoTokenizer
from datasets import load_dataset
import timeit

t = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased")
ds = load_dataset("wikimedia/wikipedia", "20231101.en", split="train[0:10000]")
[Problematic line] --> ds_with_column = ds.map(lambda x: {"extra_column": 10}, num_proc=2)

start = timeit.default_timer()
out = t.batch_encode_plus(ds["text"])
stop = timeit.default_timer()
print(f"Time to encode: {(stop-start)} s")

The timed portion of the code takes ~4 seconds without the indicated line, vs ~45 seconds with it. Please let me know what I can do to augment the dataset with new columns, while also keep the tokenization fast. Note, the above is just a minimal example to highlight the problem. The problem doesn’t happen if num_proc=1 in the map call.

Topic		Replies	Views
When using Dataset.map to tokenize a dataset, the speed slows down as the progress approaches 100% 🤗Datasets	3	914	December 23, 2024
Tokenizer dataset is very slow 🤗Tokenizers	3	4350	March 2, 2024
I set up a different batch_size, but the time of data processing has not changed 🤗Tokenizers	0	537	September 1, 2021
.map() function extremely slow 🤗Datasets	1	1349	September 13, 2023
Dataset.map hangs on tokenization (relatively small dataset) 🤗Datasets	2	1981	April 22, 2022

Tokenizer performance is slow, after call to dataset map

Related topics