Hi, I’ve been looking this problem up all day, however, I cannot find a good practice for running multi-GPU LLM inference, information about DP/deepspeed documentation is so outdated.
I just want to do the most naive data parallelism with Multi-GPU LLM inference (llama).
My code is based on some very basic llama generation code:
model = AutoModelForCausalLM.from_pretrained(
llama_model_id,
config=config,
torch_dtype=torch.float16,
load_in_4bit=True,
device_map='auto',
use_safetensors=False,
)
tokenizer = AutoTokenizer.from_pretrained(llama_model_id)
model_input = tokenizer(data_processed["input"], return_tensors='pt', add_special_tokens=False, padding=True).to('cuda')
model.eval()
with torch.no_grad():
output = model.generate(**model_input, max_new_tokens=300,
temperature=0,
top_p = 0.95,
top_k= 5)
The machine is with 4*RTX3090, and I’m trying to do Data Parallelism, where I want data to be sent to 4 GPUs separately and gathered afterward. However, I don’t see how to implement it with model.generate()
specifically.
I’m trying to wrap it up with ‘DataParallel’ from torch but get this error 'DataParallel' object has no attribute 'config'
, look like the model from huggingface cannot be warpped up with DataParallel. I also tried with finetuning evaluation mode Trainer.predict()
and set predict_with_generate=True
, however, I’m not able to set parameters like temperature
and the results are different from model.generate()
I’m doing the naive way: Spilt my data into 4 splits, for each one, I run it on 1 GPU, and use a for loop to batchinize the data in each split. This is actually much faster than doing model parallelism(device_map=“auto”).
However, the same model is loaded 4 times from disk and the memory is exploding and it’s extremely inconvenient. Is there a way to implement the simplest DataParallel with huggingface models? Any ideas would be appreciated!