Loading model directly to GPU omitting RAM

Hi,

I have a problem when loading a LLM model.

While the model is being loaded it is first allocated in RAM memory, then in the GPU. However I have currently only 128 GB of RAM available but two NVIDIA A100 80GB. I would like to omit loading the model first to RAM and instead loading it directly to GPU.

This is the code:

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto"  # automatically places model layers on available devices
)

prompt = "xxxxx."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=50,
    do_sample=True,
    top_k=50,
    top_p=0.9,
    temperature=0.7
)

generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)

I tried with device_map set as “auto” or “cuda”, but still the model is first loaded to RAM.

1 Like

With device_map=, accelerate manages memory, so I think it is more reliable to specify it with device for your use.

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    #device_map="auto"  # automatically places model layers on available devices
    #device_map=0 # maybe cuda:0 only
    device="cuda", # "cuda:0" is also ok
).to("cuda") # if necessary

Thanks, but these options have already been tried. The same effect, still loading to RAM at the first place.

1 Like

I see, so it’s the Transoformers’ specifications that are using up the RAM…
I thought about offloading it to disk as a last resort, but RAM is better than that…

Thank you, the second link seems to have some solution. I will try and let know.

1 Like

This is strange.

I added “low_cpu_mem_usage=True”, but still received:

Loading checkpoint shards:  40%|████      | 12/30 [02:35<03:00, 10.05s/it]
Process finished with exit code 137 (interrupted by signal 9:SIGKILL)
1 Like

RAM is consumed, but I don’t think it consumes that much…:thinking:
To begin with, it’s not that there is a lot of VRAM and not enough RAM.
As a rare case, for example, with old versions of Ubuntu, if the swap settings were wrong, SIGKILL would occur even if there was enough RAM.