I used the
timeit module to test the difference between including and excluding the
device=0 argument when instantiating a pipeline for
gpt2, and found an enormous performance benefit of adding
device=0; over 50 repetitions, the best time for using
device=0 was 184 seconds, while the development node I was working on killed my process after 3 repetitions.
However, it looks like this performance benefit does not uniformly apply across different models. I am using the following code to load an LLM as a
from transformers import pipeline
pipe = pipeline('text-generation', model='MetaIX/GPT4-X-Alpasta-30b', device=0)
device=0, my process gets killed before completion of a single load (processes on this node are killed when they reach 2 hours of CPU time).
What else can I do to speed up the loading of LLMs into pipeline objects?
MetaIX/GPT4-X-Alpasta-30b model is already downloaded locally, and this still happens
Here’s what I figured out in case it’s helpful to anyone else.
It seems like the model I chose was just substantially larger & required a lot more memory than the
gpt2 model. In order to get both model loading and inference to work without OOM Errors, I used the following code to generate text for a given
# Load model and tokenizer
checkpoint = 'MetaIX/GPT4-X-Alpasta-30b'
model = AutoModelForCausalLM.from_pretrained(checkpoint,
torch_dtype=torch.float16, device_map='auto', load_in_8bit=True)
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
# Generate predictions
inputs = tokenizer(prompt, return_tensors='pt')
inputs = inputs.to(0)
output = model.generate(inputs['input_ids'], max_new_tokens=500)
response = tokenizer.decode(output.tolist())
The key for me was the
load_in_8bit parameter. While I could load the model just fine without it, I would get CUDA OOM errors when it came time to use
model.generate() for inference. Inference is still pretty slow, but at least it doesn’t get killed!