How can I use multi-GPU inference for my LlamaForCausalLM model?

Hey, I’d like to use a DDP style inference to accelerate my “LlamaForCausal” model’s inference speed. However, through the tutorials of the HuggingFace’s “accelerate” package. I only see a elated tutorial with a stable-diffution model(it uses “DiffusionPipeline” from the “diffusers”) as the example. I tried to modify the “DiffusionPipeline” to a “TextGenerationPipeline” but seems not work.

When I tried to do a equivlent operation to move the pipe’s device to the PartialState’s device.

# tutorial example
from diffusers import DiffusionPipeline

pipe = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16)
distributed_state = PartialState()
pipe.to(distributed_state.device)

#my code
from transformers import pipeline

pipe = pipeline(task='text-generation', model=my_model, ....)
distributed_state = PartialState()
pipe.to(distributed_state.device)

it raise error says the TextGenerationPipeline has no attribute “to” then I dont know what next I can do to achieve my goal.

Hi @YalunHu, pipeline have a device arg that you can use. Otherwise, here’s an alternative script that you can use. We will update the section about distributed inference soon to add more examples:

from accelerate import PartialState  # Can also be Accelerator or AcceleratorState
from accelerate.utils import gather_object
from transformers import AutoTokenizer, AutoModelForCausalLM
from tqdm import tqdm

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf", low_cpu_mem_usage=True
)

prompts = [
    "I would like to",
    "hello how are you",
    "what is going on",
    "roses are red and",
    "welcome to the hotel",
]

distributed_state = PartialState()
model.to(distributed_state.device)

batch_size = 2
pad_to_multiple_of = 8 
tokenizer.pad_token = tokenizer.eos_token

# split into batch
formatted_prompts = [
    prompts[i : i + batch_size] for i in range(0, len(prompts), batch_size)
]

padding_side_default = tokenizer.padding_side
tokenizer.padding_side = "left"

# tokenize each batch
tokenized_prompts = [
    tokenizer(formatted_prompt, padding=True, pad_to_multiple_of=pad_to_multiple_of, return_tensors="pt")
    for formatted_prompt in formatted_prompts
]

completions_per_process = []
with distributed_state.split_between_processes(tokenized_prompts, apply_padding=True) as batched_prompts:
    for batch in tqdm(batched_prompts, desc=f"Generating completions on device {distributed_state.device}"):
        # move the batch to the correct 
        batch = batch.to(distributed_state.device)
        outputs = model.generate(**batch, max_new_tokens=20)
        outputs = [output[len(prompt) :] for prompt, output in zip(batch["input_ids"], outputs)]
        generated_text = tokenizer.batch_decode(outputs, skip_special_tokens=True)
        completions_per_process.extend(generated_text)
        
completions_gather = gather_object(completions_per_process)
# Drop duplicates produced by apply_padding in  split_between_processes
completions = completions_gather[: len(prompts)]
# Reset tokenizer padding side
tokenizer.padding_side = padding_side_default
if distributed_state.is_main_process:
    print(completions)