While the model is being loaded it is first allocated in RAM memory, then in the GPU. However I have currently only 128 GB of RAM available but two NVIDIA A100 80GB. I would like to omit loading the model first to RAM and instead loading it directly to GPU.
This is the code:
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto" # automatically places model layers on available devices
)
prompt = "xxxxx."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=50,
do_sample=True,
top_k=50,
top_p=0.9,
temperature=0.7
)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)
I tried with device_map set as “auto” or “cuda”, but still the model is first loaded to RAM.
With device_map=, accelerate manages memory, so I think it is more reliable to specify it with device for your use.
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
#device_map="auto" # automatically places model layers on available devices
#device_map=0 # maybe cuda:0 only
device="cuda", # "cuda:0" is also ok
).to("cuda") # if necessary
I see, so it’s the Transoformers’ specifications that are using up the RAM…
I thought about offloading it to disk as a last resort, but RAM is better than that…
RAM is consumed, but I don’t think it consumes that much…
To begin with, it’s not that there is a lot of VRAM and not enough RAM.
As a rare case, for example, with old versions of Ubuntu, if the swap settings were wrong, SIGKILL would occur even if there was enough RAM.