How to prevent too many checkpoints with run_clm.py?

JacquesThibs · July 2, 2022, 1:27am

I’m fine-tuning GPT-2 on a lot of data and ended up using all of my disk space on google drive. I’m not sure which args prevent too many checkpoints. Right now I’m getting a new checkpoint at every 500 example, but I’d like to avoid making so many checkpoints. At the very least only keep the best checkpoint.

Next fine-tuning run I’ll be using the following code (waiting to re-gain access to colab GPU), but I’m not sure if it’ll prevent the additional checkpoints:

!python gpt-2/run_clm.py \
    --model_name_or_path gpt2 \
    --train_file alignment_texts_87606.csv \
    --do_train \
    --fp16 \
    --overwrite_cache \
    –-overwrite_output_dir \
    -–num_train_epochs 1 \
    --per_device_train_batch_size=2 \
    --output_dir gpt-2/tmp/alignment-texts-clm

I don’t think it will. I need something to limit the number of checkpoints. I don’t even know where the number 500 for each checkpoint comes from.

Thanks for the help!

BramVanroy · July 3, 2022, 11:09am

You can use save_total_limit as an argument. Note that this will just delete older checkpoints, not necessarily the worst ones!

github.com

huggingface/transformers/blob/49c8c67fb815a277405f84dea4a66353e19fb347/src/transformers/training_args.py#L232-L234


      
          save_total_limit (`int`, *optional*):
              If a value is passed, will limit the total amount of checkpoints. Deletes the older checkpoints in
              `output_dir`.

JacquesThibs · July 3, 2022, 4:43pm

How would I make sure to keep the best model?

sgugger · July 4, 2022, 1:00pm

You need to pass along --load_best_model_at_end then, so the Trainer keeps track of the best model and does not delete it.

yubi-sanprit · November 7, 2022, 7:59am

@sgugger @BramVanroy @JacquesThibs How can we implement lazy loading of data in RAM while training the model from scratch?

yubi-sanprit · December 14, 2022, 5:35am

Transformers dataset dict format and its map method to call any function like tokenisation and grouping is designed to run in batches.It will handle any big data with batch run. So, work with any size big data use convert your dataset in Transformers dataset dict format and map method

Topic		Replies	Views
Checkpoints and disk storage 🤗Transformers	15	8049	June 2, 2024
Resuming Training in the Cloud, What Checkpoints are needed? Beginners	0	507	May 19, 2021
Saving checkpoints in drive 🤗Transformers	6	4067	July 19, 2022
Running out of Memory with run_clm.py Beginners	3	1680	December 14, 2022
Questions about Checkpoints within HF Beginners	0	305	May 10, 2021

How to prevent too many checkpoints with run_clm.py?

Related topics