I have noticed that when I load the 70B model (specifically LLaMA-2) into the CPU using the ‘low_cpu_mem_usage=True’ and ‘torch_dtype=“auto”’ flags, it has almost no effect on the CPU memory usage. However, if I remove either of these flags, it consumes a significant amount of memory. I am curious about the reason behind this behavior. Is there any memory-mapping happening in the background? And if so, when does it trigger? I would really appreciate it if you could help me understand this better.
Thank you for your time!
Hi @hanguo, I tested and it has something to do with safetensors format + fp16 (torch_dtype=“auto will set torch_dtype=torch.float16) + low_cpu_mem_usage = True. The memory consumption also skyrocket with we use the PyTorch bin format (pytorch bin + fp16 + low_cpu_mem_usage = True). Maybe try in another channel as it is related to the integration of llama2 and safetensors ?
Thanks for the response! What’s a good alternative channel for question like this?
Maybe in the transformers channel as they might have more insights about safetensors and how llama was implemented in transformers !