Dataset and Training Batching

Hello Everyone,

I have a question on batching, I am using the following to process datasets for fine-tuning:

test_ds = Dataset.from_pandas(test_df[['idx', 'question', 'sentence', 'label']])
  tokenizer_robert_base = AutoTokenizer.from_pretrained(robert_base)
  def preprocess_function_roberta_base(examples):
      return tokenizer_robert_base(examples[sentence1_key], examples[sentence2_key], max_length=max_input_length, truncation='only_second')
test_ds_en = test_ds.map(preprocess_function_roberta_base)

the .map() method has parameters batched=True and batch_size=batch_size. When and why would I use this? I ask this because I presume that batching in TrainingArguments below is where the batching should happen. What happens if I use both?

args = TrainingArguments(
      'checkpoints/',
      evaluation_strategy = "steps",
      learning_rate=learning_rate,
      per_device_train_batch_size=batch_size, per_device_eval_batch_size=batch_size,
      num_train_epochs=epochs,
      weight_decay=0.01,
      load_best_model_at_end=True,
      metric_for_best_model=metric_name, gradient_accumulation_steps=2)

Hi!

With the batched flag in map, you control whether your map function will get a single example to process or a batch of samples, which size is determined by batch_size (1000 by default), in a single call. It is advised to set batched to True whenever possible for better performance (e.g., our fast tokenizers can process a batch in parallel).

And Trainer’s batch_size controls the number of examples the model will get in one iteration for training/evaluation.