Prohibitively large RAM consumption on Trainer validation

That is not a question, rather an issue I wanted share, as it took me a while to debug.
I am training an image segmentation model on a large dataset with Trainer.
The images are rather big (1024x1024), I have many classes and and the validation set is 500 examples. Once I am doing validation (after the prediction step is over), machine freezes for 5-10 minutes and then either continue, or crashes for unknown reason.

After hours spent, I figured out the reason.

Here is my RAM consumption graph:

The reasons for this is that Trainer first predicts ALL examples and then stores them.

As you can imaging, creating 1024x1024x100x500 logits tensor and 1024x1024x500 labels tensor takes some memory.

I have passed eval_do_concat_batches=False, otherwise the iterated concat with every batch kill the performance and RAM even faster.

As far as I searched, there is no way of doing evaluation “per-sample” and then averaging the metrics, as it is common in computer vision. I hope this thread will save someone hours of debugging

Solution in this PR

1 Like

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.