GPU memory usage is twice (2x) what I calculated based on number of parameters and floating point precision

Also do note that your GPU will reserve some space in it when the driver warms up. It’s better to use torch.cuda.memory_allocated() here.

E.g. just reserving a tiny tensor on the GPU will use 152MiB in nvidia-smi:

import torch

t = torch.tensor([0.,1.]).cuda()

import time
time.sleep(10)