Inference workflow in compile mode using transformers.pipeline()

Arunima693 · August 26, 2024, 2:08pm

Hi,

I am trying to run an inference workflow of a Llama model in compile mode using transformers.pipeline(). I am using the following line of codes to run the inference workflow in compile mode:

model = LlamaForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-70B",use_cache=True,device_map='auto')
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-70B",use_cache=True,truncation=True,padding="max_length",max_length=64,return_tensors="pt")
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"

model = torch.compile(model)
pipeline = transformers.pipeline(
                     "text-generation",
                       model=model,
                       model_kwargs={"torch_dtype": torch.bfloat16},
                       tokenizer = tokenizer,
                       device_map="auto",
                        )
			
generation_config = {
                        "num_beams":1,
                        "max_new_tokens":32,
                        "do_sample":True,
                        "use_cache":True,
                        }
outputs = pipeline(input_prompt,**generation_config)

It is expected that torch.compile() should compile the model, print some compilation messages but I am not getting any compilation messages.
There is no error message but not getting any compilation messages.

Can you please guide me what might be wrong in code? How can I run an inference workflow of llama model in compile mode using transformers.pipeline()?

Thanks

Topic		Replies	Views
How to compile and finetune the pytorch-based transformer model? 🤗Transformers	0	1163	July 21, 2023
How to compile the generate method with PT 2.0? 🤗Transformers	0	990	March 9, 2023
LLama 3.1 torch.compile & static cache 🤗Transformers	2	371	December 9, 2024
AssertionError: Torch not compiled with CUDA enabled 🤗Transformers	0	2968	June 1, 2023
Causal Language Model from Huggingface does not compile Amazon Inferentia & Trainium	4	1549	April 28, 2023

Inference workflow in compile mode using transformers.pipeline()

Related topics