I started with a pretrained model
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=2).to(0)
then I fine tuned it and to do inference I tokenized some text
inputs = tokenizer(text[0:8], return_tensors="pt", padding="max_length", truncation=True).to(0)
where I am taking the first 8 strings out of a list. If I take much more than that, inference will crash with an out-of-memory error
If I print out the memory summary I get
| PyTorch CUDA memory summary, device ID 0 |
| CUDA OOMs: 0 | cudaMalloc retries: 0 |
| Metric | Cur Usage | Peak Usage | Tot Alloc | Tot Freed |
| Allocated memory | 1296 MiB | 6376 MiB | 2387 GiB | 2386 GiB |
| from large pool | 1294 MiB | 6374 MiB | 2387 GiB | 2385 GiB |
| from small pool | 1 MiB | 2 MiB | 0 GiB | 0 GiB |
| Active memory | 1296 MiB | 6376 MiB | 2387 GiB | 2386 GiB |
| from large pool | 1294 MiB | 6374 MiB | 2387 GiB | 2385 GiB |
| from small pool | 1 MiB | 2 MiB | 0 GiB | 0 GiB |
| Requested memory | 1255 MiB | 6331 MiB | 2381 GiB | 2380 GiB |
| from large pool | 1254 MiB | 6329 MiB | 2381 GiB | 2380 GiB |
| from small pool | 1 MiB | 2 MiB | 0 GiB | 0 GiB |
| GPU reserved memory | 6586 MiB | 6586 MiB | 6586 MiB | 0 B |
| from large pool | 6582 MiB | 6582 MiB | 6582 MiB | 0 B |
| from small pool | 4 MiB | 4 MiB | 4 MiB | 0 B |
| Non-releasable memory | 196471 KiB | 606224 KiB | 761869 MiB | 761677 MiB |
| from large pool | 195996 KiB | 603804 KiB | 761322 MiB | 761131 MiB |
| from small pool | 475 KiB | 2567 KiB | 546 MiB | 546 MiB |
| Allocations | 610 | 856 | 140516 | 139906 |
| from large pool | 227 | 413 | 92010 | 91783 |
| from small pool | 383 | 522 | 48506 | 48123 |
| Active allocs | 610 | 856 | 140516 | 139906 |
| from large pool | 227 | 413 | 92010 | 91783 |
| from small pool | 383 | 522 | 48506 | 48123 |
| GPU reserved segments | 202 | 202 | 202 | 0 |
| from large pool | 200 | 200 | 200 | 0 |
| from small pool | 2 | 2 | 2 | 0 |
| Non-releasable allocs | 42 | 69 | 46836 | 46794 |
| from large pool | 34 | 49 | 36752 | 36718 |
| from small pool | 8 | 25 | 10084 | 10076 |
| Oversize allocations | 0 | 0 | 0 | 0 |
| Oversize GPU segments | 0 | 0 | 0 | 0 |
then I do the inference
and the result looks like a matrix with 16 floats
tensor([[0.3936, 0.1505],
[0.4990, 0.4486],
[0.4969, 0.2942],
[0.7494, 0.1412],
[0.4528, 0.1090],
[0.4687, 0.3311],
[0.4891, 0.1428],
[0.7678, 0.0872]], device='cuda:0', grad_fn=<AddmmBackward0>)
but it consumes a huge amount of memory on the GPU as long as result
is held by the Python interpreter
| PyTorch CUDA memory summary, device ID 0 |
| CUDA OOMs: 0 | cudaMalloc retries: 0 |
| Metric | Cur Usage | Peak Usage | Tot Alloc | Tot Freed |
| Allocated memory | 6296 MiB | 6376 MiB | 2387 GiB | 2381 GiB |
| from large pool | 6294 MiB | 6374 MiB | 2387 GiB | 2380 GiB |
| from small pool | 2 MiB | 2 MiB | 0 GiB | 0 GiB |
| Active memory | 6296 MiB | 6376 MiB | 2387 GiB | 2381 GiB |
| from large pool | 6294 MiB | 6374 MiB | 2387 GiB | 2380 GiB |
| from small pool | 2 MiB | 2 MiB | 0 GiB | 0 GiB |
| Requested memory | 6251 MiB | 6331 MiB | 2381 GiB | 2375 GiB |
| from large pool | 6249 MiB | 6329 MiB | 2381 GiB | 2375 GiB |
| from small pool | 2 MiB | 2 MiB | 0 GiB | 0 GiB |
| GPU reserved memory | 6586 MiB | 6586 MiB | 6586 MiB | 0 B |
| from large pool | 6582 MiB | 6582 MiB | 6582 MiB | 0 B |
| from small pool | 4 MiB | 4 MiB | 4 MiB | 0 B |
| Non-releasable memory | 99669 KiB | 606224 KiB | 760864 MiB | 760767 MiB |
| from large pool | 98000 KiB | 603804 KiB | 760318 MiB | 760223 MiB |
| from small pool | 1669 KiB | 2567 KiB | 546 MiB | 544 MiB |
| Allocations | 847 | 856 | 140516 | 139669 |
| from large pool | 410 | 413 | 92010 | 91600 |
| from small pool | 437 | 522 | 48506 | 48069 |
| Active allocs | 847 | 856 | 140516 | 139669 |
| from large pool | 410 | 413 | 92010 | 91600 |
| from small pool | 437 | 522 | 48506 | 48069 |
| GPU reserved segments | 202 | 202 | 202 | 0 |
| from large pool | 200 | 200 | 200 | 0 |
| from small pool | 2 | 2 | 2 | 0 |
| Non-releasable allocs | 37 | 69 | 46775 | 46738 |
| from large pool | 29 | 49 | 36717 | 36688 |
| from small pool | 8 | 25 | 10058 | 10050 |
| Oversize allocations | 0 | 0 | 0 | 0 |
| Oversize GPU segments | 0 | 0 | 0 | 0 |
in particular the active and allocated memory increased by 5 GB which seems incredible.
I have 150 or so strings I’d like to do inference on and I can work around this by doing a batch of 8, converting the result from a pytorch tensor to a numpy array, and doing another batch.
I am left wondering though what the hell is going on here? Is this just the way it works with this library or am I do doing something silly that’s making my memory consumption expand by a factor of 156 million?