How to reduce cache during training

Hey I am using huggingface trainer right now and noticing that every time I finish training using trainer.train() the cache occupies a certain amount of my memory. You can see by looking at the yellow bar.

Here are examples of my args

args = Seq2SeqTrainingArguments(
    output_dir=save_folder,
    overwrite_output_dir=True,
    logging_strategy="epoch",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-4,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    save_total_limit=3,
    num_train_epochs=70,
    predict_with_generate=True,
    fp16=True,
    report_to="none",
    load_best_model_at_end=True,
    seed=65,
    generation_config=gen_config,
    # generation_max_length=128,
    # generation_num_beams=10,
    metric_for_best_model="bleu",
)

Although, I posted here a seq2seqtrainingargs. It is the same for trainingargs.
How can I reduce this cache? It will pile up when I want to try experimenting with multiple models and cause out of memory.

THANKS!

@sgugger I am sorry, sir. But could you please help me with this?

I’m not 100% certain what you mean.

If you are concerned about the disk space that gets filled up with cache files, there are two options: you can use the Dataset.cleanup_cache_files() to remove the cache files associated with a particular dataset object or you can manually rm -rf the cache directory (often somewhere in ~/.cache/huggingface/datasets/). I suggest reading the article about the Datasets cache here.

If you are talking about RAM or L1/L2/L3 cache, if the training process has terminated, the OS should have freed the memory associated with that process. If a training run has failed improperly it could be hogging memory resources, in which case you’d have to identify the process and kill -9 it. If you are trying to run many different experiments at the same time, unfortunately, your system might just not have enough memory to do this. You can try using the IterableDataset instead of the Dataset as it has a lighter footprint.

Thanks for replying. Sorry, I am not facing either of your suggestions. Let me explain more. The image I attached came from the htop command. I am using a remote machine here (proxmox) that has enough memory (32GB). What I faced was every time I finish the model’s training using trainer.train(), the memory bar is filled with cache (yellow portion). Also, you can see from the image that the memory usage was only 3.84G/32G, but because of the yellow, it’s almost filled. I did not have failed experiments or run multiple experiments. I don’t understand this either. One thing I found is that if I delete the save model folder specified by output_dir, the yellow bar will go away.

Ah okay, thanks for clarification.

Since training is complete, would

import gc
del trainer
gc.collect()

work for you?

1 Like

I have not tried this one. THANKS, I will try it