I am currently using LLaVA for inference and I was wondering if there was a way to avoid reloading the checkpoint shards every time I predict a new sample. For reference, I am closely following the doc and the colab here: llava-hf/llava-1.5-7b-hf · Hugging Face.
provide us with a snippet from your code
otherwize I can only say don’t call this function every time you do an inference
LlavaForConditionalGeneration.from_pretrained
I’m using the pipeline from HuggingFace, so my code is super simple:
image = Image.open(image_path)
pipe = pipeline(
"image-to-text",
model=model_id,
model_kwargs={"quantization_config": quantization_config},
)
output = pipe(
image, prompt=text_prompt, generate_kwargs={"max_new_tokens": 200}
)
Should I use pure transformers library instead to have more control?
1 Like