Why use batched=True in map function?

I have a dataset and a tokenizer:

dataset = load_dataset(path='/Users/petar/Documents/data', split='train')


def encode(examples):
    return tokenizer(examples['text'], truncation=True, padding='max_length', max_length=128)


# dataset = dataset.shuffle()
dataset = dataset.map(encode, batched=True)  # Use num_proc=N. Investigate why use batched=True?

I see that when I used batched=True, the tokenization happens significantly faster. What is the reason and is there any difference if I train the model on the batched data vs unbatched data. If yes, what should be the size of batch_size? Should it match per_device_train_batch_size in TrainingArguments ?

I also have a question about DataCollatorForLanguageModelling:

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=True, mlm_probability=0.15)

I pass it to my trainer:

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset,
)

But why do I need to specify the tokenizer in the data collator when I already have tokenized my data with the map function?

Regards

1 Like

No, the batch size should not be the same as for the training. The default in the Dataset.map method is 1,000 which is more than enough for the use case. As for why it’s faster, it’s all explained in the course. Fast tokenizers need a lot of texts to be able to leverage parallelism in Rust (a bit like a GPU needs a batch of examples to be more efficient).

1 Like

I will have to watch the course these days. Thanks!