Fine-tuning Llama-7B

I am trying to fine-tune Llama-7B on a dataset of batch size 1 (so data is not the issue memory wise). I am using deepspeed with huggingface trainer. The issue I am facing is that deepspeed doesn’t let me set the model to a device separately (it does so automatically, causing OOM errors), but as a consequence, executing non-training/forward passes through the model/obtain logits take forever as they run on the CPU.

So what I am trying to do is withhold a GPU so that the model can execute forward calls on that GPU, after allowing deepspeed to modify it.

I have tried using deepspeed --include localhost:(GPUs I want deepspeed to use), but this sets CUDA_VISIBLE_DEVICES to exclude the withheld GPU I want to use, and deepspeed automatically uses all GPUs in --include. Is there any way I can solve this?

I had the same problem, but I solved it by using this rep. GitHub - mallorbc/Finetune_LLMs: Repo for fine-tuning GPTJ and other GPT models
You can listen to the description of how to use it on YouTube. How To Fine-tune The LLaMA Models(GPT3 Alternative) - YouTube

1 Like

Also, try to reduce the block size from 1024 to 128 and use batch size 64