Hello Guys,
i have a few (prop) stupid questions about the HF-Checkpoint System.
-
While training i set
save_total_limit
to be 3 and i forghott to useload_best_model_at_end
. Is the Last Checkpoint the best Result? Or do i have to validate all 3 Checkpoints on my own to find out which did best? -
I have some problems with my Modell randomly crashing because i run out of CUDA-Memory (sometimes after 8 Hours, sometimes it runs without a problem). If i set
overwrite_output_dir
could i just do something like just running the same command a few times after each other to make sure the training finishes?
! python fine_tune.py [.. args .. ]
! gsutil -> save results
! python fine_tune.py [.. args .. ]
! gsutil -> save results
! python fine_tune.py [.. args .. ]
! gsutil -> save results
Ty in advcanded