Hello all,
Here is how I create a HuggingFacePipeline for the Llama3 model and use it with the ChatHuggingFace wrapper.
llm = HuggingFacePipeline.from_model_id(model_id=model_path, task="text-generation",
model_kwargs={"torch_dtype": bfloat16, "device_map": "auto" },
device=None,
batch_size=32,
pipeline_kwargs=dict(
max_new_tokens=12000,
temperature=0.6,
))
model = ChatHuggingFace(llm=llm)
I have 2 GPUs across which this is spread. I see that both my GPUs still have lots of free space. Increasing batch size doesn’t change the GPU memory utilization, which makes me think it isn’t working in batch mode maybe. How can I accelerate my inference as I will need to make LLM calls in the order of several hundreds?
Any tips will be highly appreciated!
Thanks.