I currently got an insufficient GPU memory error with the config below. Training on 8 x V100 GPUs.
It doesn’t appear imidiately though, but rather non-deterministicly far into the training, which rather points to a memory leak somewhere. Would you have some tips or ideas how to approach this?
training_args = TrainingArguments(