Deploying Llama2 7B fine tuned model on inf2.xlarge

Hi there!

I am trying to deploy a fine-tuned model using an inferentia2 instance. I have not trained the model myself. The original model can be found at: Irisjacobs/Llama-2-7b-chat-hf-Examify. I have compiled this model in LarsJacobs2003/Examify-Llama2-7B-NeuronCompiled-FP16. Trying to actually deploy the compiled model I have ran into two different issues:

  1. When deploying the model on a inf2.8xlarge instance, I does work but I get very weird responses. Often using different languages and signs than were used in the prompt. Maybe that one uses BF16 in training while FP16 in compiling? Is there like a set of things you have to watch out for so that the model gets compiled correctly?

  2. When trying to deploy on a inf2.xlarge instance it will generate some errors. These are the logs produced:

  • #015Downloading shards: 0%| | 0/2 [00:00<?, ?it/s]#015Downloading shards: 50%|█████ | 1/2 [00:34<00:34, 34.63s/it]#015Downloading shards: 100%|██████████| 2/2 [00:48<00:00, 22.65s/it]#015Downloading shards: 100%|██████████| 2/2 [00:48<00:00, 24.45s/it]

  • #015Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s] #033[2m#033[3mrank#033[0m#033[2m=#033[0m0#033[0m

  • #033[2m2024-05-08T13:07:00.917421Z#033[0m #033[31mERROR#033[0m #033[1mshard-manager#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Shard process was signaled to shutdown with signal 9 #033[2m#033[3mrank#033[0m#033[2m=#033[0m0#033[0m

  • #033[2m2024-05-08T13:07:00.955959Z#033[0m #033[31mERROR#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Shard 0 failed to start

  • #033[2m2024-05-08T13:07:00.955976Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Shutting down shards

  • Error: ShardCannotStart

This would make sense if the model was just too big to run on this instance. However, I have seen benchmarks of people deploying a Llama2 7B model on a inf2.xlarge instance.