Deploying Llama2 7B fine tuned model on inf2.xlarge

LarsJacobs2003 · May 8, 2024, 1:19pm

Hi there!

I am trying to deploy a fine-tuned model using an inferentia2 instance. I have not trained the model myself. The original model can be found at: Irisjacobs/Llama-2-7b-chat-hf-Examify. I have compiled this model in LarsJacobs2003/Examify-Llama2-7B-NeuronCompiled-FP16. Trying to actually deploy the compiled model I have ran into two different issues:

When deploying the model on a inf2.8xlarge instance, I does work but I get very weird responses. Often using different languages and signs than were used in the prompt. Maybe that one uses BF16 in training while FP16 in compiling? Is there like a set of things you have to watch out for so that the model gets compiled correctly?
When trying to deploy on a inf2.xlarge instance it will generate some errors. These are the logs produced:

#015Downloading shards: 0%| | 0/2 [00:00<?, ?it/s]#015Downloading shards: 50%|█████ | 1/2 [00:34<00:34, 34.63s/it]#015Downloading shards: 100%|██████████| 2/2 [00:48<00:00, 22.65s/it]#015Downloading shards: 100%|██████████| 2/2 [00:48<00:00, 24.45s/it]
#015Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s] #033[2m#033[3mrank#033[0m#033[2m=#033[0m0#033[0m
#033[2m2024-05-08T13:07:00.917421Z#033[0m #033[31mERROR#033[0m #033[1mshard-manager#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Shard process was signaled to shutdown with signal 9 #033[2m#033[3mrank#033[0m#033[2m=#033[0m0#033[0m
#033[2m2024-05-08T13:07:00.955959Z#033[0m #033[31mERROR#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Shard 0 failed to start
#033[2m2024-05-08T13:07:00.955976Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Shutting down shards
Error: ShardCannotStart

This would make sense if the model was just too big to run on this instance. However, I have seen benchmarks of people deploying a Llama2 7B model on a inf2.xlarge instance.

Topic		Replies	Views
How can I deploy a Llama2-like model in int4/int8 on inference endpoints? Inference Endpoints on the Hub	0	1261	October 27, 2023
Deploy/Inference options for fintuned models? Models	1	389	February 6, 2024
Deploying models finetuned by AutoTrain Beginners	0	226	November 26, 2023
Reduced inference f1 score with QLoRA finetuned model Intermediate	1	881	September 6, 2023
Performance problems with finetuned model (Llama 2 7B based) Beginners	3	693	June 10, 2024

Deploying Llama2 7B fine tuned model on inf2.xlarge

Related topics