No, the batch size should not be the same as for the training. The default in the Dataset.map
method is 1,000 which is more than enough for the use case. As for why it’s faster, it’s all explained in the course. Fast tokenizers need a lot of texts to be able to leverage parallelism in Rust (a bit like a GPU needs a batch of examples to be more efficient).
1 Like