I ran evaluation after resume from checkpoint, but I got OOM error
Here is my command line:
--output_dir=./pretrained-bert-c4-bpe-fp32-workers8 \
--resume_checkpoint="./pretrained-bert-c4-bpe-fp32-workers8/checkpoint-1000000" \
--overwrite_output_dir=True \
--max_steps=1000000 \
--warmup_steps=10000 \
--learning_rate=1e-04 \
--per_device_train_batch_size=8 \
--per_device_eval_batch_size=8 \
--gradient_accumulation_steps=1 \
--logging_steps=100 \
--save_steps=10000 \
--save_total_limit=1 \
--evaluation_strategy steps \
--eval_steps 1000 \
--ignore_data_skip
Here is the error log
^M 4%|â–Ť | 3826/86776 [40:46<14:39:15, 1.57it/s]^[[ATraceback (most recent call last):
File "pretrain_c4_megatron_bpe_moe.py", line 178, in <module>
train_model()
File "pretrain_c4_megatron_bpe_moe.py", line 174, in train_model
trainer.train(resume_from_checkpoint=model_args.resume_checkpoint)
File "/workspace/transformers/src/transformers/trainer.py", line 1539, in train
metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
File "/workspace/transformers/src/transformers/trainer.py", line 2323, in evaluate
output = eval_loop(
File "/workspace/transformers/src/transformers/trainer.py", line 2528, in evaluation_loop
logits = self._nested_gather(logits)
File "/workspace/transformers/src/transformers/trainer.py", line 2661, in _nested_gather
tensors = distributed_concat(tensors)
File "/workspace/transformers/src/transformers/trainer_pt_utils.py", line 171, in distributed_concat
return type(tensor)(distributed_concat(t, num_total_examples) for t in tensor)
File "/workspace/transformers/src/transformers/trainer_pt_utils.py", line 171, in <genexpr>
return type(tensor)(distributed_concat(t, num_total_examples) for t in tensor)
File "/workspace/transformers/src/transformers/trainer_pt_utils.py", line 174, in distributed_concat
dist.all_gather(output_tensors, tensor)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2083, in all_gather
work = default_pg.allgather([tensor_list], [tensor], opts)
RuntimeError: CUDA out of memory. Tried to allocate 3.73 GiB (GPU 1; 15.78 GiB total capacity; 8.16 GiB already allocated; 797.75 MiB free; 13.68 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
I also tried to add --eval_accumulation_step=1
, but the eval speed is very very low (18.31s/it vs 1.57it/s)
9%|â–‰ | 7937/86776 [8:26:04<400:55:34, 18.31s/it]