Questions about Checkpoints within HF

Hello Guys,

i have a few (prop) stupid questions about the HF-Checkpoint System.

  1. While training i set save_total_limit to be 3 and i forghott to use load_best_model_at_end. Is the Last Checkpoint the best Result? Or do i have to validate all 3 Checkpoints on my own to find out which did best?

  2. I have some problems with my Modell randomly crashing because i run out of CUDA-Memory (sometimes after 8 Hours, sometimes it runs without a problem). If i set overwrite_output_dir could i just do something like just running the same command a few times after each other to make sure the training finishes?

! python fine_tune.py [.. args .. ]
! gsutil -> save results 
! python fine_tune.py [.. args .. ]
! gsutil -> save results 
! python fine_tune.py [.. args .. ]
! gsutil -> save results 

Ty in advcanded