TrainingArguments class - max_steps formula when using streaming dataset

monta · April 14, 2023, 3:37am

Objective

Need a definite formula to decide the value to set max_steps when using streaming dataset.

Background

There are several questions raised about max_steps when using streaming dataset.

Explicitly set number of training steps using Trainer
Streaming dataset into Trainer: does not implement len, max_steps has to be specified](Streaming dataset into Trainer: does not implement __len__, max_steps has to be specified)

According to the documents, it is set to the total number of training steps which should be number of total mini-batches.

max_steps

If set to a positive number, the total number of training steps to perform. Overrides num_train_epochs.

I am afraid it is not clear.

Question

Suppose there is a small dataset of 2048 rows in the train split of a Huggingface Dataset, and the training arguments are set as below except max_steps as below.

training_args = TrainingArguments(
    output_dir="bloom_finetuned",
    max_steps=MAX_STEPS,
    num_train_epochs=3,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    learning_rate=2e-5,
    weight_decay=0.01, 
    no_cuda=False,
)

Then for a system that has single GPU:

MAX_STEPS =num_train_epochs * num_rows_in_train / per_device_train_batch_size

Where:

num_rows_in_train=2048 is total number of records in the training dataset
per_device_train_batch_size=1 is the batch size to be sent to GPU
num_train_epochs=1 is the number of epochs to run

Is this correct?

If there are multiple GPU devices being used in parallel, then:

MAX_STEPS =num_train_epochs * num_rows_in_train / per_device_train_batch_size / num_gpu_devices

Is this correct?

Confirmation

The Trainer training shows huge number of epochs for the above setting. Is this supposed to be like this?

***** Running training *****
  Num examples = 6,144
  Num Epochs = 9,223,372,036,854,775,807      <-----
  Instantaneous batch size per device = 1
  Total train batch size (w. parallel, distributed & accumulation) = 1
  Gradient Accumulation steps = 1
  Total optimization steps = 6,144
  Number of trainable parameters = 559,214,592

The reason per_device_train_batch_size=1 is because training BLOOM takes up GPU memory and cannot set > 1.

suranap · September 14, 2023, 5:35pm

I can’t answer your questions, but I did see Num Epochs hit this large number. In the code here, if you set max_steps then that overrides num_train_epochs. They assign sys.maxsize so the rest of the code basically ignores num_train_epochs.

Topic		Replies	Views
Explicitly set number of training steps using Trainer 🤗Transformers	5	9328	September 16, 2020
Streaming dataset into Trainer: does not implement __len__, max_steps has to be specified 🤗Datasets	6	4478	March 21, 2023
There seems to be not a single sample in your epoch_iterator, stopping training at step 0! This is expected if you're using an IterableDataset and set num_steps (5000000) higher than the number of available samples Beginners	2	1674	April 19, 2023
What is "steps" in TrainingArguments Beginners	2	1949	May 9, 2022
Setting max_steps with IterableDataset still errors 🤗Transformers	4	1078	November 10, 2023

TrainingArguments class - max_steps formula when using streaming dataset

Objective

Background

Question

Confirmation

Related topics