How to reduce cache during training

Nevermetyou · January 9, 2024, 1:25am

Hey I am using huggingface trainer right now and noticing that every time I finish training using trainer.train() the cache occupies a certain amount of my memory. You can see by looking at the yellow bar.

Here are examples of my args

args = Seq2SeqTrainingArguments(
    output_dir=save_folder,
    overwrite_output_dir=True,
    logging_strategy="epoch",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-4,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    save_total_limit=3,
    num_train_epochs=70,
    predict_with_generate=True,
    fp16=True,
    report_to="none",
    load_best_model_at_end=True,
    seed=65,
    generation_config=gen_config,
    # generation_max_length=128,
    # generation_num_beams=10,
    metric_for_best_model="bleu",
)

Although, I posted here a seq2seqtrainingargs. It is the same for trainingargs.
How can I reduce this cache? It will pile up when I want to try experimenting with multiple models and cause out of memory.

THANKS!

Nevermetyou · January 10, 2024, 1:57am

@sgugger I am sorry, sir. But could you please help me with this?

lkurlandski · January 12, 2024, 12:12am

I’m not 100% certain what you mean.

If you are concerned about the disk space that gets filled up with cache files, there are two options: you can use the Dataset.cleanup_cache_files() to remove the cache files associated with a particular dataset object or you can manually rm -rf the cache directory (often somewhere in ~/.cache/huggingface/datasets/). I suggest reading the article about the Datasets cache here.

If you are talking about RAM or L1/L2/L3 cache, if the training process has terminated, the OS should have freed the memory associated with that process. If a training run has failed improperly it could be hogging memory resources, in which case you’d have to identify the process and kill -9 it. If you are trying to run many different experiments at the same time, unfortunately, your system might just not have enough memory to do this. You can try using the IterableDataset instead of the Dataset as it has a lighter footprint.

Nevermetyou · January 12, 2024, 6:14am

Thanks for replying. Sorry, I am not facing either of your suggestions. Let me explain more. The image I attached came from the htop command. I am using a remote machine here (proxmox) that has enough memory (32GB). What I faced was every time I finish the model’s training using trainer.train(), the memory bar is filled with cache (yellow portion). Also, you can see from the image that the memory usage was only 3.84G/32G, but because of the yellow, it’s almost filled. I did not have failed experiments or run multiple experiments. I don’t understand this either. One thing I found is that if I delete the save model folder specified by output_dir, the yellow bar will go away.

lkurlandski · January 12, 2024, 2:50pm

Ah okay, thanks for clarification.

Since training is complete, would

import gc
del trainer
gc.collect()

work for you?

Nevermetyou · January 16, 2024, 2:21am

I have not tried this one. THANKS, I will try it

Topic		Replies	Views
How to prevent Huggingface Trainer from reaching out of memory? Beginners	0	378	February 8, 2022
Trainer leaked memory? DeepSpeed	1	775	October 15, 2024
Free up GPU memory after training is finished or interrupted (on Colab) 🤗Transformers	1	2316	May 30, 2024
Cuda out of memory while using Trainer API Beginners	1	1762	October 20, 2021
Repeated training runs out of GPU memory 🤗Transformers	3	262	December 16, 2024

How to reduce cache during training

Related topics