NameError when tokenizing with num_proc

pete88b · May 7, 2022, 6:03am

On my windows machine, running the multiprocessing code in The map() method’s superpowers fails with "NameError: name 'slow_tokenizer' is not defined"

Binding slow_tokenizer as a slow_tokenize_function parameter makes the code run …

slow_tokenizer = AutoTokenizer.from_pretrained("bert-base-cased", use_fast=False)
def slow_tokenize_function(examples, slow_tokenizer=slow_tokenizer):
    return slow_tokenizer(examples["review"], truncation=True)
tokenized_dataset = drug_dataset.map(slow_tokenize_function, batched=True, num_proc=8)

… but it’s much slower than without num_proc - I think the python multiprocessing issues on Jupyter and Windows are pretty well know (o:

Topic		Replies	Views
Num_proc is not working with map Beginners	5	2254	April 15, 2024
Issue of multiprocessing in map function 🤗Datasets	2	337	March 18, 2024
Map multiprocessing Issue 🤗Datasets	31	17744	July 16, 2024
Map with num_proc over 1 fails 🤗Datasets	1	170	April 24, 2024
When using Dataset.map to tokenize a dataset, the speed slows down as the progress approaches 100% 🤗Datasets	3	929	December 23, 2024

NameError when tokenizing with num_proc

Related topics