Trainer will evaluate using my entire validation set(60k), which gives me cuda memory usage issue:. Is there a param that allows evaluating only on some batches in validation set?

Hello, I’m using Trainer to run a bert model. My Training dataset has 600k and validation 60k. For training part, model will stop at 2500 step and start to evaluate on my validation set. And it will loop through my whole validation set which gives me cuda issue. I wonder if there is a way to allow model to evaluate only some batches in my validation set?

I checked all params in training_args but no luck there so posting her for some help.

training_args = TrainingArguments(
    output_dir='./scm-bert-base-uncased-v12/',          # output directory
    num_train_epochs=1,              # total number of training epochs
    per_device_train_batch_size=6,  # batch size per device during training
    per_device_eval_batch_size=6,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./scm-bert-base-uncased-v12/logs',            # directory for storing logs
    load_best_model_at_end=True,     # load the best model when finished training (default metric is loss)
    # but you can specify `metric_for_best_model` argument to change to accuracy or other metric
    logging_steps=2500,               # log & save weights each logging_steps
    gradient_accumulation_steps = 2,
    evaluation_strategy="steps",   # evaluate each `logging_steps`
    report_to= "none"

Training:(stop at 2500th step)
***** Running training *****

Num examples = 631296
Num Epochs = 1
Instantaneous batch size per device = 6
Total train batch size (w. parallel, distributed & accumulation) = 12
Gradient Accumulation steps = 2
Total optimization steps = 52608
5%|▍ | 2500/52608 [03:04<1:00:23, 13.83it/s]***** Running Evaluation *****

Evaluating:(need to loop through all 11304 steps and failed in the middle)
RuntimeError: CUDA out of memory. Tried to allocate 1.02 GiB (GPU 0; 31.75 GiB total capacity; 29.14 GiB already allocated; 1.02 GiB free; 29.30 GiB reserved in total by PyTorch)

52%|█████▏ | 5916/11304 [06:19<07:41, 11.67it/s]


You can use eval_accumulation_steps argument to put the predictions back on the CPU every xxx steps, this should help with your OOM error.

You’re awesome!! And just to confirm evaluation is still done on entire validation set right? I see the output and model finished all steps needed. eval_accumulation_steps is only for the purpose of managing memory(accumulate some predictions and send to CPU so that GPU is free for other steps)

Yes, the whole evaluation set is treated. It just puts the predictions on the CPU regularly instead of accumulating everything on the GPU. It’s a tad slower but as you saw, it avoids OOM errors :wink:

Thank you!! :blush: