Dataset and Training Batching

oqq09 · January 10, 2022, 7:30pm

Hello Everyone,

I have a question on batching, I am using the following to process datasets for fine-tuning:

test_ds = Dataset.from_pandas(test_df[['idx', 'question', 'sentence', 'label']])
  tokenizer_robert_base = AutoTokenizer.from_pretrained(robert_base)
  def preprocess_function_roberta_base(examples):
      return tokenizer_robert_base(examples[sentence1_key], examples[sentence2_key], max_length=max_input_length, truncation='only_second')
test_ds_en = test_ds.map(preprocess_function_roberta_base)

the .map() method has parameters batched=True and batch_size=batch_size. When and why would I use this? I ask this because I presume that batching in TrainingArguments below is where the batching should happen. What happens if I use both?

args = TrainingArguments(
      'checkpoints/',
      evaluation_strategy = "steps",
      learning_rate=learning_rate,
      per_device_train_batch_size=batch_size, per_device_eval_batch_size=batch_size,
      num_train_epochs=epochs,
      weight_decay=0.01,
      load_best_model_at_end=True,
      metric_for_best_model=metric_name, gradient_accumulation_steps=2)

mariosasko · February 9, 2022, 6:49pm

Hi!

With the batched flag in map, you control whether your map function will get a single example to process or a batch of samples, which size is determined by batch_size (1000 by default), in a single call. It is advised to set batched to True whenever possible for better performance (e.g., our fast tokenizers can process a batch in parallel).

And Trainer’s batch_size controls the number of examples the model will get in one iteration for training/evaluation.

Topic		Replies	Views
Streaming datasets and batched mapping 🤗Datasets	5	2666	January 10, 2022
Clarification on Batch mapping 🤗Datasets	2	920	November 2, 2023
Why use batched=True in map function? 🤗Datasets	2	7309	May 17, 2022
Tokenize iterable dataset Models	0	263	June 7, 2023
I set up a different batch_size, but the time of data processing has not changed 🤗Tokenizers	0	537	September 1, 2021

Dataset and Training Batching

Related topics