How to merge a ORPO fine tuned llama3 model without OOM?

Hi !
I am trying to do a simple ORPO fine tunning using a very small dataset: “celsowm/auryn_dpo_orpo”

The problem is, when I try to do this after training:

model = AutoModelForCausalLM.from_pretrained(
    base_model,
    low_cpu_mem_usage=True,
    return_dict = True,
    torch_dtype=torch.float16,
    device_map="auto"
)

model, tokenizer = setup_chat_format(model, tokenizer)

model = PeftModel.from_pretrained(model, new_model)
model = model.merge_and_unload()

I got OOM !

The complete kaggle is here: https://www.kaggle.com/code/celsofontes/fine-tunning-orpo

Any hints?

this worked for me

model = AutoModelForCausalLM.from_pretrained(
base_model,
low_cpu_mem_usage=True,
return_dict = True,
torch_dtype=torch.float16,
device_map=“auto”,
)

gc.collect()

model, tokenizer = setup_chat_format(model, tokenizer)
gc.collect()
model = PeftModel.from_pretrained(model, new_model)
gc.collect()
model = model.merge_and_unload()
gc.collect()

unfortunately still not working:

CUDA out of memory. Tried to allocate 1.96 GiB. GPU 0 has a total capacty of 15.89 GiB of which 670.12 MiB is free. Process 2065 has 15.22 GiB memory in use. Of the allocated memory 14.80 GiB is allocated by PyTorch, and 120.73 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I even tried this new param “offload_buffers”:

model = AutoModelForCausalLM.from_pretrained(
    base_model,
    low_cpu_mem_usage=True,
    return_dict = True,
    torch_dtype=torch.float16,
    device_map="auto",
    offload_buffers=True
)

But when I tried to merge, OOM again :face_exhaling:

I’ve discovered the way:

instead:
device_map="auto"

using:

device_map="cpu"

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.