How to avert 'loading checkpoint shards'?

EugeneRene · August 6, 2023, 1:08pm

Hello, I have downloaded the model to my local computer in hopes of it would help me avoid the dreadfully slow loading process. Sadly it didn’t work as intend with the demo code. Is hat possible, and if so how can I adapt the code to do it?

from transformers import T5Tokenizer, T5ForConditionalGeneration

import torch

torch.cuda.set_per_process_memory_fraction(1.0)

tokenizer = T5Tokenizer.from_pretrained("LOCAL_PATH")

model = T5ForConditionalGeneration.from_pretrained("LOCAL_PATH", device_map="auto")

input_text = "INPUT"

input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")

outputs = model.generate(input_ids)

print(tokenizer. Decode(outputs[0]))

jtlicardo · January 20, 2024, 7:40pm

Split your code into two parts. Load the model once in a Jupyter notebook cell, and run the generation in a separate cell. This way, you load the model only once, speeding up the process.

First cell (run once):

from transformers import T5Tokenizer, T5ForConditionalGeneration
import torch
torch.cuda.set_per_process_memory_fraction(1.0)
tokenizer = T5Tokenizer.from_pretrained("LOCAL_PATH")
model = T5ForConditionalGeneration.from_pretrained("LOCAL_PATH", device_map="auto")

Second cell (run as needed):

input_text = "Your input text"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")
outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))

tiago-machado · May 14, 2024, 6:59pm

Is there a way to do it without using notebooks?

pesonen · August 21, 2024, 9:25am

You can load the model into a local server, Flask for example.

friedahuang · November 1, 2024, 4:01am

You can use FastAPI.

This is the code that works for me:

from contextlib import asynccontextmanager
from fastapi import FastAPI

service = Service()
app = FastAPI()


@asynccontextmanager
async def lifespan(app: FastAPI):
   # Note: this will only be called once
   service.load_model()


@app.get("/")
async def root():
    return {"message": "ColPali Search API", "docs": "/docs", "health": "/health"}


@app.post("/search")
async def search(query: str):
    response = service.search(query=query)
    return {"response": response}

Topic		Replies	Views
Loading a locally saved model is very slow 🤗Transformers	1	3801	July 10, 2024
Avoid loading checkpoint shards for each inference 🤗Transformers	2	2359	December 19, 2023
General question about large model loading 🤗Accelerate	2	925	November 28, 2024
Loading checkpoint shards very slow 🤗Transformers	1	7506	December 19, 2023
How to load a model and make in parallel (T5) 🤗Transformers	0	398	February 22, 2021

How to avert 'loading checkpoint shards'?

Related topics