CUDA out of memory when using Trainer with compute_metrics

Recently, I want to fine-tuning Bart-base with Transformers (version 4.1.1). The fine-tuning process is very smooth with compute_metrics=None in Trainer. However, when I implement a function of computing metrics and offer this function to Trainer, I received the CUDA out of memory error during the evaluation stage.

I want to try this feature, so my implementation is straightforward.

def compute_metrics(pred):
    preds = pred.predictions
    labels = pred.label_ids
    print(preds.shape, labels.shape)
    return {
        'loss': 1
    }

Because when compute_metrics=None, the training process is normal, so I think it can’t be the problem of batch size. But I still tried a smaller batch size, but even if I set the batch size to 1, the situation is the same. I even try tinier-bart, but I received the same error at last.

One thing caught my attention :thinking: . When I set a tiny batch size, the memory will not fill up at once, but the occupancy rate has increased until the memory can’t hold more data. That is, the processed data is not released in time. Is there any magic operation to solve this problem?

When computing metrics inside the Trainer, your predictions are all gathered together on the device (GPU/TPU) and only passed back to the CPU at the end (because that operation can be slow). If your dataset is large (or your model outputs large predictions) you can use eval_accumulation_steps to set a number of steps after which your predictions are sent back to the CPU (slower but uses less device memory). This should avoid your OOM.

I have tried to use eval_accumulation_steps, but another problem occurs.

Here is part of my fine-tuning code:

args = TrainingArguments(
    output_dir="exp/bart/results",
    do_train=True,
    do_eval=True,
    evaluation_strategy="steps",
    eval_steps=1000,
    logging_dir="exp/bart/logs",
    num_train_epochs=1,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    gradient_accumulation_steps=2,
    eval_accumulation_steps=1,
)

trainer = Trainer(
    model=bart,
    args=args,
    data_collator=collate_fn,
    train_dataset=train_set,
    eval_dataset=eval_set,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

Besides, the number of lines in the evaluation set is 22,161. When I set eval_accumulation_steps=1, I receive:

MemoryError: Unable to allocate 149. GiB for an array with shape (22162, 36, 50265) and data type float32

It means that the Trainer will apply for all the space at once. Do I need to set other parameters to ensure that the Trainer applies for the right space every time?

This error means you are trying to get predictions that just don’t fit in RAM, so there is nothing Trainer can do to help. I don’t know which bart models you’re using, but it looks like you have huge logits so you should split your evaluation dataset in small parts or use a custom evaluation loop.