Loading of a model takes much RAM, passing to CUDA doesn't free RAM

I’m trying to finetune 1.3B model. And so I search for the way to optimize RAM usage.

I noted that after load of a model it takes much RAM.

After model is loaded:
11.51 GB total memory used
0.0 GB used by torch objects on GPU
2 MiB total mem used on GPU

And when I move it to GPU it
a) takes only 5GB in VRAM (perhaps another 1.3GB is taken by Torch).
b) doesn’t free any RAM, even takes some 2.5GB more.

So the problem I see:
a) Model occupates much more space in RAM then in VRAM.
b) It doesn’t free RAM upon moving to CUDA.

You can check this behavior in this notebook: Google Colaboratory

You will need “Large memory” instance, since while transferring to CUDA it even overshoots 13GB RAM limit. I use Torch 1.7.0+cu110 since instance has CUDA 11.2. But with the default 1.9.0+cu102 it is more or less the same.

Python garbage collector doesn’t help also.

Any thoughts on this?