Hello Everyone,
I have a question on batching, I am using the following to process datasets for fine-tuning:
test_ds = Dataset.from_pandas(test_df[['idx', 'question', 'sentence', 'label']])
tokenizer_robert_base = AutoTokenizer.from_pretrained(robert_base)
def preprocess_function_roberta_base(examples):
return tokenizer_robert_base(examples[sentence1_key], examples[sentence2_key], max_length=max_input_length, truncation='only_second')
test_ds_en = test_ds.map(preprocess_function_roberta_base)
the .map() method has parameters batched=True
and batch_size=batch_size
. When and why would I use this? I ask this because I presume that batching in TrainingArguments below is where the batching should happen. What happens if I use both?
args = TrainingArguments(
'checkpoints/',
evaluation_strategy = "steps",
learning_rate=learning_rate,
per_device_train_batch_size=batch_size, per_device_eval_batch_size=batch_size,
num_train_epochs=epochs,
weight_decay=0.01,
load_best_model_at_end=True,
metric_for_best_model=metric_name, gradient_accumulation_steps=2)