Trainer will evaluate using my entire validation set(60k), which gives me cuda memory usage issue:. Is there a param that allows evaluating only on some batches in validation set?

Keyu · April 28, 2022, 5:40am

Hello, I’m using Trainer to run a bert model. My Training dataset has 600k and validation 60k. For training part, model will stop at 2500 step and start to evaluate on my validation set. And it will loop through my whole validation set which gives me cuda issue. I wonder if there is a way to allow model to evaluate only some batches in my validation set?

I checked all params in training_args but no luck there so posting her for some help.

training_args = TrainingArguments(
    output_dir='./scm-bert-base-uncased-v12/',          # output directory
    num_train_epochs=1,              # total number of training epochs
    per_device_train_batch_size=6,  # batch size per device during training
    per_device_eval_batch_size=6,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./scm-bert-base-uncased-v12/logs',            # directory for storing logs
    load_best_model_at_end=True,     # load the best model when finished training (default metric is loss)
    # but you can specify `metric_for_best_model` argument to change to accuracy or other metric
    logging_steps=2500,               # log & save weights each logging_steps
    eval_steps=2500,
    save_steps=2500,
    gradient_accumulation_steps = 2,
    evaluation_strategy="steps",   # evaluate each `logging_steps`
    report_to= "none"
)

Training:(stop at 2500th step)
***** Running training *****

Num examples = 631296
Num Epochs = 1
Instantaneous batch size per device = 6
Total train batch size (w. parallel, distributed & accumulation) = 12
Gradient Accumulation steps = 2
Total optimization steps = 52608
5%|▍ | 2500/52608 [03:04<1:00:23, 13.83it/s]***** Running Evaluation *****

Evaluating:(need to loop through all 11304 steps and failed in the middle)
Screen Shot 2022-04-27 at 10.37.50 PM

RuntimeError: CUDA out of memory. Tried to allocate 1.02 GiB (GPU 0; 31.75 GiB total capacity; 29.14 GiB already allocated; 1.02 GiB free; 29.30 GiB reserved in total by PyTorch)

52%|█████▏ | 5916/11304 [06:19<07:41, 11.67it/s]

[148]:

sgugger · April 28, 2022, 11:54am

You can use eval_accumulation_steps argument to put the predictions back on the CPU every xxx steps, this should help with your OOM error.

Keyu · April 28, 2022, 5:09pm

You’re awesome!! And just to confirm evaluation is still done on entire validation set right? I see the output and model finished all steps needed. eval_accumulation_steps is only for the purpose of managing memory(accumulate some predictions and send to CPU so that GPU is free for other steps)

sgugger · April 29, 2022, 11:54am

Yes, the whole evaluation set is treated. It just puts the predictions on the CPU regularly instead of accumulating everything on the GPU. It’s a tad slower but as you saw, it avoids OOM errors

Keyu · April 29, 2022, 6:06pm

Thank you!!

Topic		Replies	Views
Evaluation error: CUDA out of memory 🤗Transformers	0	742	August 22, 2022
Cuda out of memory during evaluation but training is fine 🤗Transformers	12	17500	February 20, 2025
How to process trainer.evaluate in batch mode to deal with Out of Memory error 🤗Datasets	0	336	March 22, 2023
Where to set the Evaluation Batch Size in Trainer Beginners	2	8832	June 17, 2022
CUDA out of memory only during validation not training 🤗Transformers	3	4600	May 9, 2023

Trainer will evaluate using my entire validation set(60k), which gives me cuda memory usage issue:. Is there a param that allows evaluating only on some batches in validation set?

Related topics