I am trying to fine-tune Llama-7B on a dataset of batch size 1 (so data is not the issue memory wise). I am using deepspeed with huggingface trainer. The issue I am facing is that deepspeed doesn’t let me set the model to a device separately (it does so automatically, causing OOM errors), but as a consequence, executing non-training/forward passes through the model/obtain logits take forever as they run on the CPU.
So what I am trying to do is withhold a GPU so that the model can execute forward calls on that GPU, after allowing deepspeed to modify it.
I have tried using deepspeed --include localhost:(GPUs I want deepspeed to use), but this sets CUDA_VISIBLE_DEVICES to exclude the withheld GPU I want to use, and deepspeed automatically uses all GPUs in --include. Is there any way I can solve this?