Well, not much different. We need to be careful with IterableDataset, though…
Same Issue here. Do you have a solution?
What worked for me was importing AutoTokenizer from transformers, and defining the tokenizer inside tokenize_function, but this takes all the time initilizing variables, and at the end it’s basically the same or more time than the original solution (without num_proc).
I think it’s related to paralelization and that there’s no tokenized defined in each thread. But there’s must be another way…
1 Like