Hi, I’ve been looking this problem up all day, however, I cannot find a good practice for running multi-GPU LLM inference, information about DP/deepspeed documentation is so outdated.
I just want to do the most naive data parallelism with Multi-GPU LLM inference (llama).
My code is based on some very basic llama generation code:
model = AutoModelForCausalLM.from_pretrained( llama_model_id, config=config, torch_dtype=torch.float16, load_in_4bit=True, device_map='auto', use_safetensors=False, ) tokenizer = AutoTokenizer.from_pretrained(llama_model_id) model_input = tokenizer(data_processed["input"], return_tensors='pt', add_special_tokens=False, padding=True).to('cuda') model.eval() with torch.no_grad(): output = model.generate(**model_input, max_new_tokens=300, temperature=0, top_p = 0.95, top_k= 5)
The machine is with 4*RTX3090, and I’m trying to do Data Parallelism, where I want data to be sent to 4 GPUs separately and gathered afterward. However, I don’t see how to implement it with
I’m trying to wrap it up with ‘DataParallel’ from torch but get this error
'DataParallel' object has no attribute 'config', look like the model from huggingface cannot be warpped up with DataParallel. I also tried with finetuning evaluation mode
Trainer.predict() and set
predict_with_generate=True, however, I’m not able to set parameters like
temperature and the results are different from
I’m doing the naive way: Spilt my data into 4 splits, for each one, I run it on 1 GPU, and use a for loop to batchinize the data in each split. This is actually much faster than doing model parallelism(device_map=“auto”).
However, the same model is loaded 4 times from disk and the memory is exploding and it’s extremely inconvenient. Is there a way to implement the simplest DataParallel with huggingface models? Any ideas would be appreciated!