Hello, I’m using Trainer to run a bert model. My Training dataset has 600k and validation 60k. For training part, model will stop at 2500 step and start to evaluate on my validation set. And it will loop through my whole validation set which gives me cuda issue. I wonder if there is a way to allow model to evaluate only some batches in my validation set?
I checked all params in training_args but no luck there so posting her for some help.
training_args = TrainingArguments(
output_dir='./scm-bert-base-uncased-v12/', # output directory
num_train_epochs=1, # total number of training epochs
per_device_train_batch_size=6, # batch size per device during training
per_device_eval_batch_size=6, # batch size for evaluation
warmup_steps=500, # number of warmup steps for learning rate scheduler
weight_decay=0.01, # strength of weight decay
logging_dir='./scm-bert-base-uncased-v12/logs', # directory for storing logs
load_best_model_at_end=True, # load the best model when finished training (default metric is loss)
# but you can specify `metric_for_best_model` argument to change to accuracy or other metric
logging_steps=2500, # log & save weights each logging_steps
eval_steps=2500,
save_steps=2500,
gradient_accumulation_steps = 2,
evaluation_strategy="steps", # evaluate each `logging_steps`
report_to= "none"
)
Training:(stop at 2500th step)
***** Running training *****
Num examples = 631296
Num Epochs = 1
Instantaneous batch size per device = 6
Total train batch size (w. parallel, distributed & accumulation) = 12
Gradient Accumulation steps = 2
Total optimization steps = 52608
5%|▍ | 2500/52608 [03:04<1:00:23, 13.83it/s]***** Running Evaluation *****
Evaluating:(need to loop through all 11304 steps and failed in the middle)
RuntimeError: CUDA out of memory. Tried to allocate 1.02 GiB (GPU 0; 31.75 GiB total capacity; 29.14 GiB already allocated; 1.02 GiB free; 29.30 GiB reserved in total by PyTorch)
52%|█████▏ | 5916/11304 [06:19<07:41, 11.67it/s]
[148]: