Prohibitively large RAM consumption on Trainer validation

dmytromishkin · April 24, 2024, 8:12am

That is not a question, rather an issue I wanted share, as it took me a while to debug.
I am training an image segmentation model on a large dataset with Trainer.
The images are rather big (1024x1024), I have many classes and and the validation set is 500 examples. Once I am doing validation (after the prediction step is over), machine freezes for 5-10 minutes and then either continue, or crashes for unknown reason.

After hours spent, I figured out the reason.

Here is my RAM consumption graph:

The reasons for this is that Trainer first predicts ALL examples and then stores them.

github.com

huggingface/transformers/blob/main/src/transformers/trainer.py#L3697


      
                  all_preds.to_cpu_and_numpy()
                  all_labels.to_cpu_and_numpy()
                  all_inputs.to_cpu_and_numpy()
          
          # After all calls to `.gather_function`, reset to `gather_for_metrics`:
          self.gather_function = self.accelerator.gather_for_metrics
          if args.past_index and hasattr(self, "_past"):
              # Clean the state at the end of the evaluation loop
              delattr(self, "_past")
          
          # Gather all remaining tensors and put them back on the CPU
          all_losses = all_losses.get_arrays()
          all_preds = all_preds.get_arrays()
          all_labels = all_labels.get_arrays()
          all_inputs = all_inputs.get_arrays()
          
          # Number of samples
          if has_length(eval_dataset):
              num_samples = len(eval_dataset)
          # The instance check is weird and does not actually check for the type, but whether the dataset has the right
          # methods. Therefore we need to make sure it also has the attribute.

As you can imaging, creating 1024x1024x100x500 logits tensor and 1024x1024x500 labels tensor takes some memory.

I have passed eval_do_concat_batches=False, otherwise the iterated concat with every batch kill the performance and RAM even faster.

As far as I searched, there is no way of doing evaluation “per-sample” and then averaging the metrics, as it is common in computer vision. I hope this thread will save someone hours of debugging

dmytromishkin · April 24, 2024, 9:48pm

Solution in this PR

github.com/huggingface/transformers

Trainer - add cache clearing and the option for batched eval metrics computation

huggingface:main ← FoamoftheSea:trainer-updates

opened 03:53AM - 30 Jan 24 UTC

FoamoftheSea

+57 -5

# What does this PR do? This PR does two things which are necessary for using… the Trainer in resource constrained environments (like my RTX-3070Ti machine): 1. Add cache clearing in training and evaluation loops - This reduces peak GPU load and prevents CUDA OOM errors when running near capacity. 2. Add Trainer arg `batch_eval_metrics` for batched eval metrics computation. - When working with limited RAM, storing all logits across the entire evaluation set may not be feasible. A user working in this condition can pass `True` to `batch_eval_metrics` and construct a `compute_metrics` function which can update average metrics at a batch level to prevent OOM errors with large eval sets. Particularly useful for vision transformers. - Previous functionality is unaltered if option is not set to `True` @muellerzr

system · April 25, 2024, 9:49am

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Trainer will evaluate using my entire validation set(60k), which gives me cuda memory usage issue:. Is there a param that allows evaluating only on some batches in validation set? Beginners	4	1319	April 29, 2022
Setting requires_grad=False seems not saving GPU memory usage 🤗Transformers	0	319	January 18, 2024
Trainer.evaluate() freezing 🤗Transformers	3	500	August 23, 2024
Colab error (memory crashes) Beginners	3	3060	April 22, 2021
WER Metric running out of Memory 🤗Datasets	3	1784	April 30, 2021

Prohibitively large RAM consumption on Trainer validation

Related topics