Why use batched=True in map function?

petarulev · May 17, 2022, 11:20am

I have a dataset and a tokenizer:

dataset = load_dataset(path='/Users/petar/Documents/data', split='train')


def encode(examples):
    return tokenizer(examples['text'], truncation=True, padding='max_length', max_length=128)


# dataset = dataset.shuffle()
dataset = dataset.map(encode, batched=True)  # Use num_proc=N. Investigate why use batched=True?

I see that when I used batched=True, the tokenization happens significantly faster. What is the reason and is there any difference if I train the model on the batched data vs unbatched data. If yes, what should be the size of batch_size? Should it match per_device_train_batch_size in TrainingArguments ?

I also have a question about DataCollatorForLanguageModelling:

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=True, mlm_probability=0.15)

I pass it to my trainer:

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset,
)

But why do I need to specify the tokenizer in the data collator when I already have tokenized my data with the map function?

Regards

sgugger · May 17, 2022, 11:44am

No, the batch size should not be the same as for the training. The default in the Dataset.map method is 1,000 which is more than enough for the use case. As for why it’s faster, it’s all explained in the course. Fast tokenizers need a lot of texts to be able to leverage parallelism in Rust (a bit like a GPU needs a batch of examples to be more efficient).

petarulev · May 17, 2022, 11:47am

I will have to watch the course these days. Thanks!

Topic		Replies	Views
Dataset and Training Batching Beginners	1	1452	February 9, 2022
Padding in datasets 🤗Datasets	6	5101	October 21, 2021
I set up a different batch_size, but the time of data processing has not changed 🤗Tokenizers	0	539	September 1, 2021
Getting correct length via DataLoader and speed 🤗Datasets	4	463	April 5, 2024
Tokenize iterable dataset Models	0	265	June 7, 2023

Why use batched=True in map function?

Related topics