Why use batched=True in map function?

sgugger · May 17, 2022, 11:44am

No, the batch size should not be the same as for the training. The default in the Dataset.map method is 1,000 which is more than enough for the use case. As for why it’s faster, it’s all explained in the course. Fast tokenizers need a lot of texts to be able to leverage parallelism in Rust (a bit like a GPU needs a batch of examples to be more efficient).

Topic		Replies	Views
Dataset and Training Batching Beginners	1	1447	February 9, 2022
Getting correct length via DataLoader and speed 🤗Datasets	4	457	April 5, 2024
Padding in datasets 🤗Datasets	6	5080	October 21, 2021
Tokenize iterable dataset Models	0	263	June 7, 2023
I set up a different batch_size, but the time of data processing has not changed 🤗Tokenizers	0	537	September 1, 2021

Why use batched=True in map function?

Related topics