I have noticed that when I load the 70B model (specifically LLaMA-2) into the CPU using the ‘low_cpu_mem_usage=True’ and ‘torch_dtype=“auto”’ flags, it has almost no effect on the CPU memory usage. However, if I remove either of these flags, it consumes a significant amount of memory. I am curious about the reason behind this behavior. Is there any memory-mapping happening in the background? And if so, when does it trigger? I would really appreciate it if you could help me understand this better.
This has been discussed in a different channel but was recommended to post in this channel.
@marcsun13 noticed that it might have something to do with safetensors format + fp16 (torch_dtype=“auto will set torch_dtype=torch.float16) + low_cpu_mem_usage = True. The memory consumption also skyrocket with we use the PyTorch bin format (pytorch bin + fp16 + low_cpu_mem_usage = True).
Thank you for your time!