I have a dataset and a tokenizer:
dataset = load_dataset(path='/Users/petar/Documents/data', split='train')
def encode(examples):
return tokenizer(examples['text'], truncation=True, padding='max_length', max_length=128)
# dataset = dataset.shuffle()
dataset = dataset.map(encode, batched=True) # Use num_proc=N. Investigate why use batched=True?
I see that when I used batched=True, the tokenization happens significantly faster. What is the reason and is there any difference if I train the model on the batched data vs unbatched data. If yes, what should be the size of batch_size? Should it match per_device_train_batch_size
in TrainingArguments
?
I also have a question about DataCollatorForLanguageModelling:
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=True, mlm_probability=0.15)
I pass it to my trainer:
trainer = Trainer(
model=model,
args=training_args,
data_collator=data_collator,
train_dataset=dataset,
)
But why do I need to specify the tokenizer in the data collator when I already have tokenized my data with the map function?
Regards