Hello! I’m fairly new to most of LLM topics, but as an exercise, I wanted to deploy mistralai/Mistral-7B-Instruct-v0.3
to AWS Inferentia 2
. To do that, I used the code available under Deploy menu. The endpoint was successfully deployed, but it appears there’s a strict limitation to “sequence_length”: 4096
as per this file: https://huggingface.co/aws-neuron/optimum-neuron-cache/blob/main/inference-cache-config/mistral.json
hub = {
"HF_MODEL_ID": "mistralai/Mistral-7B-Instruct-v0.3",
"HF_NUM_CORES": "2",
"HF_AUTO_CAST_TYPE": "fp16",
"MAX_BATCH_SIZE": "8",
"MAX_INPUT_TOKENS": "3686",
"MAX_TOTAL_TOKENS": "4096",
"HF_TOKEN": "<REPLACE WITH YOUR TOKEN>",
}
I tried to deploy the model with values higher then the ones above, but it failed, stating there is no cached version of the model with this amount of tokens (i had 10k set as parameter).
My question would be, how to overcome this limitation and deploy this model to Inferentia2 with higher number of input and total tokens? I tried to compile it on my own, but this is beyond my current abilities.
I’d appreciate the help. Thanks!