How to prevent too many checkpoints with run_clm.py?

I’m fine-tuning GPT-2 on a lot of data and ended up using all of my disk space on google drive. I’m not sure which args prevent too many checkpoints. Right now I’m getting a new checkpoint at every 500 example, but I’d like to avoid making so many checkpoints. At the very least only keep the best checkpoint.

Next fine-tuning run I’ll be using the following code (waiting to re-gain access to colab GPU), but I’m not sure if it’ll prevent the additional checkpoints:

!python gpt-2/run_clm.py \
    --model_name_or_path gpt2 \
    --train_file alignment_texts_87606.csv \
    --do_train \
    --fp16 \
    --overwrite_cache \
    –-overwrite_output_dir \
    -–num_train_epochs 1 \
    --per_device_train_batch_size=2 \
    --output_dir gpt-2/tmp/alignment-texts-clm

I don’t think it will. I need something to limit the number of checkpoints. I don’t even know where the number 500 for each checkpoint comes from.

Thanks for the help!

You can use save_total_limit as an argument. Note that this will just delete older checkpoints, not necessarily the worst ones!

1 Like

How would I make sure to keep the best model?

You need to pass along --load_best_model_at_end then, so the Trainer keeps track of the best model and does not delete it.

1 Like

@sgugger @BramVanroy @JacquesThibs How can we implement lazy loading of data in RAM while training the model from scratch?