Why is the tensor produced by inference so big?

PaulHoule · April 15, 2023, 3:25am

I started with a pretrained model

tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=2).to(0)

then I fine tuned it and to do inference I tokenized some text

inputs = tokenizer(text[0:8], return_tensors="pt", padding="max_length", truncation=True).to(0)

where I am taking the first 8 strings out of a list. If I take much more than that, inference will crash with an out-of-memory error

If I print out the memory summary I get

|===========================================================================|
|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |   1296 MiB |   6376 MiB |   2387 GiB |   2386 GiB |
|       from large pool |   1294 MiB |   6374 MiB |   2387 GiB |   2385 GiB |
|       from small pool |      1 MiB |      2 MiB |      0 GiB |      0 GiB |
|---------------------------------------------------------------------------|
| Active memory         |   1296 MiB |   6376 MiB |   2387 GiB |   2386 GiB |
|       from large pool |   1294 MiB |   6374 MiB |   2387 GiB |   2385 GiB |
|       from small pool |      1 MiB |      2 MiB |      0 GiB |      0 GiB |
|---------------------------------------------------------------------------|
| Requested memory      |   1255 MiB |   6331 MiB |   2381 GiB |   2380 GiB |
|       from large pool |   1254 MiB |   6329 MiB |   2381 GiB |   2380 GiB |
|       from small pool |      1 MiB |      2 MiB |      0 GiB |      0 GiB |
|---------------------------------------------------------------------------|
| GPU reserved memory   |   6586 MiB |   6586 MiB |   6586 MiB |      0 B   |
|       from large pool |   6582 MiB |   6582 MiB |   6582 MiB |      0 B   |
|       from small pool |      4 MiB |      4 MiB |      4 MiB |      0 B   |
|---------------------------------------------------------------------------|
| Non-releasable memory | 196471 KiB | 606224 KiB | 761869 MiB | 761677 MiB |
|       from large pool | 195996 KiB | 603804 KiB | 761322 MiB | 761131 MiB |
|       from small pool |    475 KiB |   2567 KiB |    546 MiB |    546 MiB |
|---------------------------------------------------------------------------|
| Allocations           |     610    |     856    |  140516    |  139906    |
|       from large pool |     227    |     413    |   92010    |   91783    |
|       from small pool |     383    |     522    |   48506    |   48123    |
|---------------------------------------------------------------------------|
| Active allocs         |     610    |     856    |  140516    |  139906    |
|       from large pool |     227    |     413    |   92010    |   91783    |
|       from small pool |     383    |     522    |   48506    |   48123    |
|---------------------------------------------------------------------------|
| GPU reserved segments |     202    |     202    |     202    |       0    |
|       from large pool |     200    |     200    |     200    |       0    |
|       from small pool |       2    |       2    |       2    |       0    |
|---------------------------------------------------------------------------|
| Non-releasable allocs |      42    |      69    |   46836    |   46794    |
|       from large pool |      34    |      49    |   36752    |   36718    |
|       from small pool |       8    |      25    |   10084    |   10076    |
|---------------------------------------------------------------------------|
| Oversize allocations  |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| Oversize GPU segments |       0    |       0    |       0    |       0    |
|===========================================================================|

then I do the inference

result=model(**inputs).logits

and the result looks like a matrix with 16 floats

tensor([[0.3936, 0.1505],
        [0.4990, 0.4486],
        [0.4969, 0.2942],
        [0.7494, 0.1412],
        [0.4528, 0.1090],
        [0.4687, 0.3311],
        [0.4891, 0.1428],
        [0.7678, 0.0872]], device='cuda:0', grad_fn=<AddmmBackward0>)

but it consumes a huge amount of memory on the GPU as long as result is held by the Python interpreter

|===========================================================================|
|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |   6296 MiB |   6376 MiB |   2387 GiB |   2381 GiB |
|       from large pool |   6294 MiB |   6374 MiB |   2387 GiB |   2380 GiB |
|       from small pool |      2 MiB |      2 MiB |      0 GiB |      0 GiB |
|---------------------------------------------------------------------------|
| Active memory         |   6296 MiB |   6376 MiB |   2387 GiB |   2381 GiB |
|       from large pool |   6294 MiB |   6374 MiB |   2387 GiB |   2380 GiB |
|       from small pool |      2 MiB |      2 MiB |      0 GiB |      0 GiB |
|---------------------------------------------------------------------------|
| Requested memory      |   6251 MiB |   6331 MiB |   2381 GiB |   2375 GiB |
|       from large pool |   6249 MiB |   6329 MiB |   2381 GiB |   2375 GiB |
|       from small pool |      2 MiB |      2 MiB |      0 GiB |      0 GiB |
|---------------------------------------------------------------------------|
| GPU reserved memory   |   6586 MiB |   6586 MiB |   6586 MiB |      0 B   |
|       from large pool |   6582 MiB |   6582 MiB |   6582 MiB |      0 B   |
|       from small pool |      4 MiB |      4 MiB |      4 MiB |      0 B   |
|---------------------------------------------------------------------------|
| Non-releasable memory |  99669 KiB | 606224 KiB | 760864 MiB | 760767 MiB |
|       from large pool |  98000 KiB | 603804 KiB | 760318 MiB | 760223 MiB |
|       from small pool |   1669 KiB |   2567 KiB |    546 MiB |    544 MiB |
|---------------------------------------------------------------------------|
| Allocations           |     847    |     856    |  140516    |  139669    |
|       from large pool |     410    |     413    |   92010    |   91600    |
|       from small pool |     437    |     522    |   48506    |   48069    |
|---------------------------------------------------------------------------|
| Active allocs         |     847    |     856    |  140516    |  139669    |
|       from large pool |     410    |     413    |   92010    |   91600    |
|       from small pool |     437    |     522    |   48506    |   48069    |
|---------------------------------------------------------------------------|
| GPU reserved segments |     202    |     202    |     202    |       0    |
|       from large pool |     200    |     200    |     200    |       0    |
|       from small pool |       2    |       2    |       2    |       0    |
|---------------------------------------------------------------------------|
| Non-releasable allocs |      37    |      69    |   46775    |   46738    |
|       from large pool |      29    |      49    |   36717    |   36688    |
|       from small pool |       8    |      25    |   10058    |   10050    |
|---------------------------------------------------------------------------|
| Oversize allocations  |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| Oversize GPU segments |       0    |       0    |       0    |       0    |
|===========================================================================|

in particular the active and allocated memory increased by 5 GB which seems incredible.

I have 150 or so strings I’d like to do inference on and I can work around this by doing a batch of 8, converting the result from a pytorch tensor to a numpy array, and doing another batch.

I am left wondering though what the hell is going on here? Is this just the way it works with this library or am I do doing something silly that’s making my memory consumption expand by a factor of 156 million?

sgugger · April 15, 2023, 4:45pm

You see to have forgotten to put your model forward in a with torch.no_grad() context manager to make sure that PyTorch does not save all activations for the backward pass.

PaulHoule · April 17, 2023, 12:24am

That worked, Thanks! I looked at the Pytorch manual and this is documented here:

https://pytorch.org/docs/stable/notes/autograd.html

there is also an “inference mode” which can be entered with a similar context manager

https://pytorch.org/docs/stable/generated/torch.inference_mode.html

Topic		Replies	Views
RuntimeError: CUDA out of memory even with simple inference Beginners	1	5411	January 16, 2022
GPU memory GPTJ inference 🤗Transformers	0	240	June 13, 2023
GPU memory not being freed between batches 🤗Transformers	0	1747	June 24, 2022
OOM when allocating using BERT Beginners	1	745	February 5, 2022
LLM ingores max_memory in inference Models	0	134	June 20, 2024

Why is the tensor produced by inference so big?

Related topics