Different results with model hosted in HuggingFace and hosted in SageMaker

david-morgan · November 14, 2023, 10:30pm

I have deployed a llama2-70b-chat-hf endpoint in SageMaker with the 763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-tgi-inference:2.0.1-tgi1.1.0-gpu-py39-cu118-ubuntu20.04 image on an ml.g5.48xlarge.

We also have an endpoint hosted in HuggingFace with the p4de with 160gb gpu ram. I have verified that the start up parameters match and the inference parameter match between the two.

However, with the same exact prompt we get slightly different responses - enough that it matters for our use case.

I can’t think of where else to check

Container (same)
Start up parameters (same)
Inference Parameter (same)
Hardware (not the same)

Is it conceivable the hardware is having an influence? Would tensor parallelization possibly account…? - not sure where to look.

Any insight or ideas would be greatly appreciated. Thank you.

PlanetDOGE · November 15, 2023, 12:41am

The only causes I can think of variance unaccounted for are model.eval() to disable any kind of dropout/batch norm issues.

Topic		Replies	Views
Custom handler with gated model Inference Endpoints on the Hub	5	823	January 25, 2024
Issues using GPU with HuggingFace (TensorFlow) model deployed to SageMaker endpoint Amazon SageMaker	0	619	December 12, 2023
Slow inference using most recent docker image Amazon SageMaker	10	3196	March 21, 2022
Is llama2 supported by the Hugging Face Text Generation Inference (TGI) Deep Learning Container on Amazon SageMaker? 🤗Transformers	0	537	August 3, 2023
Anyone else VERY confused? Community Calls	1	1232	December 19, 2023

Different results with model hosted in HuggingFace and hosted in SageMaker

Related topics