Different results with model hosted in HuggingFace and hosted in SageMaker

I have deployed a llama2-70b-chat-hf endpoint in SageMaker with the 763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-tgi-inference:2.0.1-tgi1.1.0-gpu-py39-cu118-ubuntu20.04 image on an ml.g5.48xlarge.

We also have an endpoint hosted in HuggingFace with the p4de with 160gb gpu ram. I have verified that the start up parameters match and the inference parameter match between the two.

However, with the same exact prompt we get slightly different responses - enough that it matters for our use case.

I can’t think of where else to check

  1. Container (same)
  2. Start up parameters (same)
  3. Inference Parameter (same)
  4. Hardware (not the same)

Is it conceivable the hardware is having an influence? Would tensor parallelization possibly account…? - not sure where to look.

Any insight or ideas would be greatly appreciated. Thank you.

The only causes I can think of variance unaccounted for are model.eval() to disable any kind of dropout/batch norm issues.