Multi-GPU LLM inference data parallelism (llama)

JerryL · October 9, 2023, 11:16pm

Hi, I’ve been looking this problem up all day, however, I cannot find a good practice for running multi-GPU LLM inference, information about DP/deepspeed documentation is so outdated.
I just want to do the most naive data parallelism with Multi-GPU LLM inference (llama).

My code is based on some very basic llama generation code:

model = AutoModelForCausalLM.from_pretrained(
    llama_model_id,
    config=config,
    torch_dtype=torch.float16,
    load_in_4bit=True,
    device_map='auto',
    use_safetensors=False,
)
tokenizer = AutoTokenizer.from_pretrained(llama_model_id)

model_input = tokenizer(data_processed["input"], return_tensors='pt', add_special_tokens=False, padding=True).to('cuda')
model.eval()
with torch.no_grad():
  output = model.generate(**model_input, max_new_tokens=300,
        temperature=0,
        top_p = 0.95,
        top_k= 5)

The machine is with 4*RTX3090, and I’m trying to do Data Parallelism, where I want data to be sent to 4 GPUs separately and gathered afterward. However, I don’t see how to implement it with model.generate() specifically.
I’m trying to wrap it up with ‘DataParallel’ from torch but get this error 'DataParallel' object has no attribute 'config', look like the model from huggingface cannot be warpped up with DataParallel. I also tried with finetuning evaluation mode Trainer.predict() and set predict_with_generate=True, however, I’m not able to set parameters like temperature and the results are different from model.generate()

I’m doing the naive way: Spilt my data into 4 splits, for each one, I run it on 1 GPU, and use a for loop to batchinize the data in each split. This is actually much faster than doing model parallelism(device_map=“auto”).

However, the same model is loaded 4 times from disk and the memory is exploding and it’s extremely inconvenient. Is there a way to implement the simplest DataParallel with huggingface models? Any ideas would be appreciated!

sp-ananth · October 25, 2023, 6:04am

Perhaps this will help: LLM Multi-GPU Batch Inference With Accelerate | by Victor May | Medium

Topic		Replies	Views
Multi-GPU inference with LLM produces gibberish 🤗Transformers	14	6513	September 28, 2024
[SOLVED] What's the right way to do GPU paralellism for inference (not training) on AutoModelForCausalLM? 🤗Transformers	1	211	August 26, 2024
How to run inference on multigpus 🤗Accelerate	0	120	November 29, 2024
Perfectly the same code, single GPU OK, multi GPU ERROR Beginners	0	77	December 1, 2024
Why transformers doesn't use Multiple GPUs (to increase tokens per second)? Beginners	7	557	September 22, 2024

Multi-GPU LLM inference data parallelism (llama)

Related topics