I am using an inference endpoint on mistralai/Mistral-7B-Instruct-v0.1
. The output is truncated, for instance, the proposed test query "Can you please let us know more details about your " yields “2019 Honda CR-V Touring?\n\n1. What is the mile”. How can I adjust the output size?
Have you tried modifying the ‘max_tokens’ parameter
No I haven’t, is this parameter documented somewhere ? Using the following yields the same result.
output = query({
"inputs": "Can you please let us know more details about your ",
"parameters": {
"max_tokens": 128
}
})
You need the correct prompt template for the model
<s>[INST] Can you please let us know more details about your [/INST]
I recommend using out python client plus prompt templates via transformers
Here is a code snippet example:
pip install transformers jinja2 huggingface-hub
from transformers import AutoModelForCausalLM, AutoTokenizer
from huggingface_hub import InferenceClient
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1")
client = InferenceClient("mistralai/Mistral-7B-Instruct-v0.1")
messages = [
{"role": "user", "content": "Can you please let us know more details about your "},
]
prompt_encoded = tokenizer.apply_chat_template(messages, tokenize=False)
output = client.text_generation(prompt_encoded, max_new_tokens=200)
print(output)
Model can be an inference endpoint
model (str, optional) — The model to run inference with. Can be a model id hosted on the Hugging Face Hub, e.g. bigcode/starcoder or a URL to a deployed Inference Endpoint. Defaults to None, in which case a recommended model is automatically selected for the task.
Ok I see, it works now with the prompt template. Thanks!
1 Like