I have deployed a llama2-70b-chat-hf endpoint in SageMaker with the 763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-tgi-inference:2.0.1-tgi1.1.0-gpu-py39-cu118-ubuntu20.04 image on an ml.g5.48xlarge.
We also have an endpoint hosted in HuggingFace with the p4de with 160gb gpu ram. I have verified that the start up parameters match and the inference parameter match between the two.
However, with the same exact prompt we get slightly different responses - enough that it matters for our use case.
I can’t think of where else to check
- Container (same)
- Start up parameters (same)
- Inference Parameter (same)
- Hardware (not the same)
Is it conceivable the hardware is having an influence? Would tensor parallelization possibly account…? - not sure where to look.
Any insight or ideas would be greatly appreciated. Thank you.