On my windows machine, running the multiprocessing code in The map()
method’s superpowers fails with "NameError: name 'slow_tokenizer' is not defined"
Binding slow_tokenizer
as a slow_tokenize_function
parameter makes the code run …
slow_tokenizer = AutoTokenizer.from_pretrained("bert-base-cased", use_fast=False)
def slow_tokenize_function(examples, slow_tokenizer=slow_tokenizer):
return slow_tokenizer(examples["review"], truncation=True)
tokenized_dataset = drug_dataset.map(slow_tokenize_function, batched=True, num_proc=8)
… but it’s much slower than without num_proc
- I think the python multiprocessing issues on Jupyter and Windows are pretty well know (o: