Hi,
I have the following code
from transformers import AutoTokenizer
from datasets import load_dataset
import timeit
t = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased")
ds = load_dataset("wikimedia/wikipedia", "20231101.en", split="train[0:10000]")
[Problematic line] --> ds_with_column = ds.map(lambda x: {"extra_column": 10}, num_proc=2)
start = timeit.default_timer()
out = t.batch_encode_plus(ds["text"])
stop = timeit.default_timer()
print(f"Time to encode: {(stop-start)} s")
The timed portion of the code takes ~4 seconds without the indicated line, vs ~45 seconds with it. Please let me know what I can do to augment the dataset with new columns, while also keep the tokenization fast. Note, the above is just a minimal example to highlight the problem. The problem doesn’t happen if num_proc=1 in the map call.