Inference with hugging face pipeline happening on CPU, even if model is loaded on GPU


I am building a chatbot using LLM like fastchat-t5-3b-v1.0 and want to reduce my inference time.

I am loading the entire model on GPU, using device_map parameter, and making use of hugging face pipeline agent for querying the LLM model. Also specifying the device=0 ( which is the 1st rank GPU) for hugging face pipeline as well.
I am monitoring the GPU and CPU usage throughout the entire execution, and I can see that though my model is on GPU, at the time of querying the model, it makes use of CPU.
The spike in CPU usage shows that query execution is happening on CPU.

Below is the code that I am using to do inference on Fastchat LLM.

from llama_index import SimpleDirectoryReader, GPTVectorStoreIndex, PromptHelper, LLMPredictor
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from llama_index import LangchainEmbedding, ServiceContext
from transformers import T5Tokenizer, T5ForConditionalGeneration
from accelerate import init_empty_weights, infer_auto_device_map

model_name = 'lmsys/fastchat-t5-3b-v1.0'

config = T5Config.from_pretrained(model_name )
with init_empty_weights():
    model_layer = T5ForConditionalGeneration(config=config)

device_map = infer_auto_device_map(model_layer, max_memory={0: "12GiB",1: "12GiB", "cpu": "0GiB"}, no_split_module_classes=["T5Block"])

# the value for device_map = {'': 0}, i.e. loading the entire Model on 1st GPU

model = T5ForConditionalGeneration.from_pretrained(model_name, torch_dtype=torch.float16, device_map=device_map, offload_folder="offload", offload_state_dict=True)

tokenizer = T5Tokenizer.from_pretrained(model_name)

from transformers import pipeline

pipe = pipeline(
    "text2text-generation", model=model, tokenizer=tokenizer, device = 0, 
    max_length=1536, temperature=0, top_p = 1, num_beams=1, early_stopping=False, 

from langchain.llms import HuggingFacePipeline
llm = HuggingFacePipeline(pipeline=pipe)

embed_model = LangchainEmbedding(HuggingFaceEmbeddings())

# set maximum input size
max_input_size = 2048
# set number of output tokens
num_outputs = 512
# set maximum chunk overlap
max_chunk_overlap = 20
# set chunk size limit
chunk_size_limit = 300
prompt_helper = PromptHelper(max_input_size, num_outputs, max_chunk_overlap)

service_context = ServiceContext.from_defaults(embed_model=embed_model, llm_predictor=LLMPredictor(llm), prompt_helper=prompt_helper, chunk_size_limit=chunk_size_limit)

# build index
documents = SimpleDirectoryReader('data').load_data()

new_index = GPTVectorStoreIndex.from_documents(documents, service_context=service_context)

query_engine = new_index.as_query_engine(

response = query_engine.query("sample query question?")

Here the “data” folder has my full input text in pdf format, and am using the GPTVectoreStoreIndex and hugging face pipeline to build the index on that and fetch the relevant chunk to generate the prompt with context and query the FastChat model as shown in the code.

Please have a look, and let me know if this is the expected behaviour.
how can I make use of GPU for query execution as well? to reduce the inference response time.