That is not a question, rather an issue I wanted share, as it took me a while to debug.
I am training an image segmentation model on a large dataset with Trainer.
The images are rather big (1024x1024), I have many classes and and the validation set is 500 examples. Once I am doing validation (after the prediction step is over), machine freezes for 5-10 minutes and then either continue, or crashes for unknown reason.
After hours spent, I figured out the reason.
Here is my RAM consumption graph:
The reasons for this is that Trainer first predicts ALL examples and then stores them.
As you can imaging, creating 1024x1024x100x500 logits tensor and 1024x1024x500 labels tensor takes some memory.
I have passed eval_do_concat_batches=False
, otherwise the iterated concat with every batch kill the performance and RAM even faster.
As far as I searched, there is no way of doing evaluation “per-sample” and then averaging the metrics, as it is common in computer vision. I hope this thread will save someone hours of debugging