QLoRA trained LLaMA2 13B deployment error on Sagemaker using text generation inference image

@elodium I ended up building the 0.9.3 image from scratch (which was half a day of work between actual compiling building and figuring out some build configurations to make the build stop freezing/blowing all of the memory even with 100GB of RAM on EC2).

I ended up deploying with the TGI 0.9.3 image I built on a g5.4xlarge, and it worked. Only issue was even though I deployed the 4-bit QLoRA LLaMA 2 13B, generation was pretty slow, and often freeze on the sagemaker deployment or times out after 30s. That was super odd because I was expecting the 4bit quantized 13B to breeze thru the generation on a 4xlarge.

Going to try the official 093 image to see if there’s a difference.

2 Likes