Setting up Mistral on Inferentia2 with higher number of tokens

saif3r · September 25, 2024, 10:02am

Hello! I’m fairly new to most of LLM topics, but as an exercise, I wanted to deploy mistralai/Mistral-7B-Instruct-v0.3 to AWS Inferentia 2. To do that, I used the code available under Deploy menu. The endpoint was successfully deployed, but it appears there’s a strict limitation to “sequence_length”: 4096 as per this file: https://huggingface.co/aws-neuron/optimum-neuron-cache/blob/main/inference-cache-config/mistral.json

hub = {
    "HF_MODEL_ID": "mistralai/Mistral-7B-Instruct-v0.3",
    "HF_NUM_CORES": "2",
    "HF_AUTO_CAST_TYPE": "fp16",
    "MAX_BATCH_SIZE": "8",
    "MAX_INPUT_TOKENS": "3686",
    "MAX_TOTAL_TOKENS": "4096",
    "HF_TOKEN": "<REPLACE WITH YOUR TOKEN>",
}

I tried to deploy the model with values higher then the ones above, but it failed, stating there is no cached version of the model with this amount of tokens (i had 10k set as parameter).
My question would be, how to overcome this limitation and deploy this model to Inferentia2 with higher number of input and total tokens? I tried to compile it on my own, but this is beyond my current abilities.
I’d appreciate the help. Thanks!

Topic		Replies	Views
Truncated output on mistralai/Mistral-7B-Instruct-v0.1 Inference Endpoints on the Hub	4	1746	December 21, 2023
Fine tuned Mistral 7B inference issue for >4k context length token with transformer 4.35+ 🤗Transformers	0	556	December 11, 2023
Number of tokens (2331) exceeded maximum context length (512) error.Even when model supports 8k Context length 🤗Transformers	8	15304	October 6, 2024
Scaling Mistral-7B on AWS SageMaker With Multiple Replica Endpoints Intermediate	0	620	January 19, 2024
sentence-transformers/all-MiniLM-L6-v2 Not working all of a sudden Beginners	9	148	May 8, 2025

Setting up Mistral on Inferentia2 with higher number of tokens

Related topics