Chapter 5 questions

Well, not much different. We need to be careful with IterableDataset, though…

Same Issue here. Do you have a solution?

What worked for me was importing AutoTokenizer from transformers, and defining the tokenizer inside tokenize_function, but this takes all the time initilizing variables, and at the end it’s basically the same or more time than the original solution (without num_proc).

I think it’s related to paralelization and that there’s no tokenized defined in each thread. But there’s must be another way…

1 Like