Avoid loading checkpoint shards for each inference

I am currently using LLaVA for inference and I was wondering if there was a way to avoid reloading the checkpoint shards every time I predict a new sample. For reference, I am closely following the doc and the colab here: llava-hf/llava-1.5-7b-hf · Hugging Face.

provide us with a snippet from your code

otherwize I can only say don’t call this function every time you do an inference


I’m using the pipeline from HuggingFace, so my code is super simple:

        image = Image.open(image_path)

        pipe = pipeline(
            model_kwargs={"quantization_config": quantization_config},

        output = pipe(
            image, prompt=text_prompt, generate_kwargs={"max_new_tokens": 200}

Should I use pure transformers library instead to have more control?