Hey, I’d like to use a DDP style inference to accelerate my “LlamaForCausal” model’s inference speed. However, through the tutorials of the HuggingFace’s “accelerate” package. I only see a elated tutorial with a stable-diffution model(it uses “DiffusionPipeline” from the “diffusers”) as the example. I tried to modify the “DiffusionPipeline” to a “TextGenerationPipeline” but seems not work.
When I tried to do a equivlent operation to move the pipe’s device to the PartialState’s device.
# tutorial example
from diffusers import DiffusionPipeline
pipe = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16)
distributed_state = PartialState()
pipe.to(distributed_state.device)
#my code
from transformers import pipeline
pipe = pipeline(task='text-generation', model=my_model, ....)
distributed_state = PartialState()
pipe.to(distributed_state.device)
it raise error says the TextGenerationPipeline has no attribute “to” then I dont know what next I can do to achieve my goal.
Hi @YalunHu, pipeline have a device arg that you can use. Otherwise, here’s an alternative script that you can use. We will update the section about distributed inference soon to add more examples:
from accelerate import PartialState # Can also be Accelerator or AcceleratorState
from accelerate.utils import gather_object
from transformers import AutoTokenizer, AutoModelForCausalLM
from tqdm import tqdm
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf", low_cpu_mem_usage=True
)
prompts = [
"I would like to",
"hello how are you",
"what is going on",
"roses are red and",
"welcome to the hotel",
]
distributed_state = PartialState()
model.to(distributed_state.device)
batch_size = 2
pad_to_multiple_of = 8
tokenizer.pad_token = tokenizer.eos_token
# split into batch
formatted_prompts = [
prompts[i : i + batch_size] for i in range(0, len(prompts), batch_size)
]
padding_side_default = tokenizer.padding_side
tokenizer.padding_side = "left"
# tokenize each batch
tokenized_prompts = [
tokenizer(formatted_prompt, padding=True, pad_to_multiple_of=pad_to_multiple_of, return_tensors="pt")
for formatted_prompt in formatted_prompts
]
completions_per_process = []
with distributed_state.split_between_processes(tokenized_prompts, apply_padding=True) as batched_prompts:
for batch in tqdm(batched_prompts, desc=f"Generating completions on device {distributed_state.device}"):
# move the batch to the correct
batch = batch.to(distributed_state.device)
outputs = model.generate(**batch, max_new_tokens=20)
outputs = [output[len(prompt) :] for prompt, output in zip(batch["input_ids"], outputs)]
generated_text = tokenizer.batch_decode(outputs, skip_special_tokens=True)
completions_per_process.extend(generated_text)
completions_gather = gather_object(completions_per_process)
# Drop duplicates produced by apply_padding in split_between_processes
completions = completions_gather[: len(prompts)]
# Reset tokenizer padding side
tokenizer.padding_side = padding_side_default
if distributed_state.is_main_process:
print(completions)