Loading half precision Pipeline

I am using Pipeline for text generation. I’d like to use a half precision model to save GPU memory. Searched the web and found that people are saying we can do this:

gen = pipeline('text-generation', model=m_path, device=0)

The problem with this solution is that when we call pipeline(), the API tries to load the model into GPU. So if there is not enough GPU memory, then Pytorch raises an exception. So there won’t be any chance to reach the next line to half the model.
What is the right way of using a half precision model in a pipeline to resolve this problem?
I was thinking of loading the pipeline into CPU, halving the model, and then moving it to GPU. Is that the right way of doing it? If yes, could you give me a piece of code to do it? Thanks.


I don’t think that’s the recommended way. The recommended way is to pass the torch_dtype as follows:

from transformers import pipeline
import torch

pipe = pipeline(model="gpt2", torch_dtype=torch.float16)

In case you want to leverage 4-bit or 8-bit, then you can do that as follows:

from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer
import torch

# note that 4bit requires at least one GPU to be available
model = AutoModelForCausalLM.from_pretrained("gpt2", load_in_4bit=True, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("gpt2")

pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer)
1 Like