Loading half precision Pipeline

darvish · January 12, 2024, 9:44pm

I am using Pipeline for text generation. I’d like to use a half precision model to save GPU memory. Searched the web and found that people are saying we can do this:

gen = pipeline('text-generation', model=m_path, device=0)
gen.model.half()

The problem with this solution is that when we call pipeline(), the API tries to load the model into GPU. So if there is not enough GPU memory, then Pytorch raises an exception. So there won’t be any chance to reach the next line to half the model.
What is the right way of using a half precision model in a pipeline to resolve this problem?
I was thinking of loading the pipeline into CPU, halving the model, and then moving it to GPU. Is that the right way of doing it? If yes, could you give me a piece of code to do it? Thanks.

nielsr · January 13, 2024, 10:53am

Hi,

I don’t think that’s the recommended way. The recommended way is to pass the torch_dtype as follows:

from transformers import pipeline
import torch

pipe = pipeline(model="gpt2", torch_dtype=torch.float16)

In case you want to leverage 4-bit or 8-bit, then you can do that as follows:

from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer
import torch

# note that 4bit requires at least one GPU to be available
model = AutoModelForCausalLM.from_pretrained("gpt2", load_in_4bit=True, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("gpt2")

pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer)

Topic		Replies	Views
`text-generation` `Pipeline` prohibitively slow to load, even with cached model 🤗Transformers	1	4319	May 23, 2023
Why do I have to use "model.half()" when I load a int4 model? Beginners	6	242	July 19, 2024
[PYTORCH] Trace on CPU and use on GPU 🤗Transformers	4	8618	July 15, 2020
How to specify the gpu number to load the input during the inference of huggingface pipeline in a multi-gpu setup? 🤗Transformers	2	557	August 8, 2024
Move model with device_map="balanced" to CPU 🤗Transformers	1	6121	February 5, 2024

Loading half precision Pipeline

Related topics