Loading half precision Pipeline

nielsr · January 13, 2024, 10:53am

Hi,

I don’t think that’s the recommended way. The recommended way is to pass the torch_dtype as follows:

from transformers import pipeline
import torch

pipe = pipeline(model="gpt2", torch_dtype=torch.float16)

In case you want to leverage 4-bit or 8-bit, then you can do that as follows:

from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer
import torch

# note that 4bit requires at least one GPU to be available
model = AutoModelForCausalLM.from_pretrained("gpt2", load_in_4bit=True, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("gpt2")

pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer)

Topic		Replies	Views
`text-generation` `Pipeline` prohibitively slow to load, even with cached model 🤗Transformers	1	4379	May 23, 2023
Why do I have to use "model.half()" when I load a int4 model? Beginners	6	264	July 19, 2024
[PYTORCH] Trace on CPU and use on GPU 🤗Transformers	4	8637	July 15, 2020
How to specify the gpu number to load the input during the inference of huggingface pipeline in a multi-gpu setup? 🤗Transformers	2	575	August 8, 2024
Move model with device_map="balanced" to CPU 🤗Transformers	1	6224	February 5, 2024

Loading half precision Pipeline

Related topics