How to prevent too many checkpoints with run_clm.py?

I’m fine-tuning GPT-2 on a lot of data and ended up using all of my disk space on google drive. I’m not sure which args prevent too many checkpoints. Right now I’m getting a new checkpoint at every 500 example, but I’d like to avoid making so many checkpoints. At the very least only keep the best checkpoint.

Next fine-tuning run I’ll be using the following code (waiting to re-gain access to colab GPU), but I’m not sure if it’ll prevent the additional checkpoints:

!python gpt-2/run_clm.py \
    --model_name_or_path gpt2 \
    --train_file alignment_texts_87606.csv \
    --do_train \
    --fp16 \
    --overwrite_cache \
    –-overwrite_output_dir \
    -–num_train_epochs 1 \
    --per_device_train_batch_size=2 \
    --output_dir gpt-2/tmp/alignment-texts-clm

I don’t think it will. I need something to limit the number of checkpoints. I don’t even know where the number 500 for each checkpoint comes from.

Thanks for the help!

1 Like

You can use save_total_limit as an argument. Note that this will just delete older checkpoints, not necessarily the worst ones!

1 Like

How would I make sure to keep the best model?

You need to pass along --load_best_model_at_end then, so the Trainer keeps track of the best model and does not delete it.

1 Like

@sgugger @BramVanroy @JacquesThibs How can we implement lazy loading of data in RAM while training the model from scratch?

Transformers dataset dict format and its map method to call any function like tokenisation and grouping is designed to run in batches.It will handle any big data with batch run. So, work with any size big data use convert your dataset in Transformers dataset dict format and map method