Loading half precision Pipeline

Hi,

I don’t think that’s the recommended way. The recommended way is to pass the torch_dtype as follows:

from transformers import pipeline
import torch

pipe = pipeline(model="gpt2", torch_dtype=torch.float16)

In case you want to leverage 4-bit or 8-bit, then you can do that as follows:

from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer
import torch

# note that 4bit requires at least one GPU to be available
model = AutoModelForCausalLM.from_pretrained("gpt2", load_in_4bit=True, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("gpt2")

pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer)
1 Like