Strange outputs in mixtral model

I’m running mixtral 7b for the first time and running into some difficulties, basically for a multitude of reasons, the output is not as expected.

outputs are supposed to be structured jsons and it works some of the times, but in other situations i get strange sequences of ‘\xa0’ that mess up the parsing of the jsons. In other cases i simply don’t get a end of dict character in the end.

for example here’s a snippet from my output:

'\t{\n"pairs": [\n\xa0\xa0\xa0\xa0{\n\xa0\xa0\xa0\xa0\xa0\xa0"suggestions": "how about another drink?"\n\xa0\xa0\xa0\xa0},'

it’s important to say that sometimes this works well, meaning sometimes i don’t see any ‘\ax0’ in the outputs and the parsing works.

another thing to note is that I’m getting this warning (maybe its related):
Setting \pad_token_idtoeos_token_id:2 for open-end generation.

A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set \padding_side=‘left’ when initializing the tokenizer.

finally, here is the configuration i use:
model_id = "mistralai/Mixtral-8x7B-Instruct-v0.1"
model = transformers.AutoModelForCausalLM.from_pretrained(
model_id,
trust_remote_code=True,
torch_dtype=bfloat16,
device_map='auto'
)
model.eval()
tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)

pipeline = transformers.pipeline(
model=model, tokenizer=tokenizer,
return_full_text=False,
task="text-generation",
temperature=0.1,
top_p=0.15,
top_k=0,
max_new_tokens=512,
repetition_penalty=1.1
)

the warning is displayed regardless if i set the tokenizer padding to left or not, and I’m also running this though a langchain class that handles hugging face pipelines

Hi,

Some comments:

  • Mixtral is natively supported in the Transformers library, so there’s no need to pass trust_remote_code=True.
  • from_pretrained puts your model in evaluation mode by default, so there’s no need to do model.eval()
  • instruction-tuned models such as “mistralai/Mixtral-8x7B-Instruct-v0.1” require a so-called chat template to make sure the tokens are prepared in the appropriate format.

Here’s how to do inference properly with Mixtral:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

# specify how to quantize the model
quantization_config = BitsAndBytesConfig(
         load_in_4bit=True,
         bnb_4bit_quant_type="nf4",
         bnb_4bit_compute_dtype="torch.float16",
)

model = AutoModelForCausalLM.from_pretrained("mistralai/Mixtral-8x7B-Instruct-v0.1", quantization_config=True, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mixtral-8x7B-Instruct-v0.1")

prompt = "My favourite condiment is"

messages = [
        {"role": "user", "content": "What is your favourite condiment?"},
        {"role": "assistant", "content": "Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!"},
        {"role": "user", "content": "Do you have mayonnaise recipes?"}
]

model_inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")

generated_ids = model.generate(model_inputs, max_new_tokens=100, do_sample=True)
tokenizer.batch_decode(generated_ids)[0]

thank you for the response.

what if my instruction does not fit into a chat scenario?

for now i want to utilize instruct abilities.

specifically i haven’t been able to parse json outputs properly because the output is lacking closing parenthesis ‘}’, that’s what i think is happening at least.

i have a specific task and i simply send the model the following message:

mixtral_format = f'<s> [INST] {formatted_text} [/INST]'

the input is quite long, approx 3000-4000 tokens.

another suspicion i have is that my output formatting instruction is in the langchain format.

here’s the last part of my input message:

'ye\n\nplease output your response only in the demanded json format\n\nThe output should be formatted as a JSON instance that conforms to the JSON schema below.\n\nAs an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}}, "required": ["foo"]}\nthe object {"foo": ["bar", "baz"]} is a well-formatted instance of the schema. The object {"properties": {"foo": ["bar", "baz"]}} is not well-formatted.\n\nHere is the output schema:\n```\n{"properties": {"main_dialogue": {"description": "The dialogue between the subject and the therapist as a string", "title": "Main Dialogue", "type": "string"}}, "required": ["main_dialogue"]}\n``` [/INST]'

do you think this could be related to the strange behaviour I’m seeing?

Even if your use case is not a chat use case, like generating JSON, you still need to comply to the messages API. Hence one way to ask the model to generate JSON is as follows:

from pydantic import BaseModel

class FoodItem(BaseModel):
    description: str 

messages = [
        {"role": "user", "content": f"Generate a food item according to the schema {FoodItem.model_json_schema()}"},
]

model_inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")

You could expand the messages list by including “user” and “assistant” few shot examples, in order to improve the JSON generation.

i understand.
can you provide information about these warnings:

Token indices sequence length is longer than the specified maximum sequence length for this model (2001 > 1024). Running this sequence through the model will result in indexing errors
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.

i thought the token length is 32K.

i also increased the max_new_tokens argument to be 4096

thank you for the responses!

i looked inside your example, specifically inside the apply_chat_template

this is the rendered text:
<s>[INST] Generate a food item according to the schema {'properties': {'description': {'title': 'Description', 'type': 'string'}}, 'required': ['description'], 'title': 'FoodItem', 'type': 'object'} [/INST]

and i modified my parsing logic to be the same as that.

still, for some longer inputs i have an issue with parsing the outputs, they don’t apply the correct json schema and results in an error like this:

File "/home/ai_center/ai_users/user/miniconda3/envs/psyq/lib/python3.9/json/__init__.py", line 359, in loads
    return cls(**kw).decode(s)
  File "/home/ai_center/ai_users/user/miniconda3/envs/psyq/lib/python3.9/json/decoder.py", line 340, in decode
    raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 4 column 4 (char 117)

i am thinking it is related to the length of the prompt because i tried to simply “chunk out” 500 characters from the center of my prompt and invoke and to my surprise the outputs seemed good.

is it possible that I’m not setting the max length attribute correctly somewhere? is it a known issue or something that can be handled from another angle?

If the JSON errors are predictable, have you considered a repair function?
Yes, it feels like and is a bit of a kluge, but…

def LLM_JSON_to_dict(txtin: str) -> dict:
    """Cleans and converts LLM text output from the model to a dictionary."""
    import regex
    import json

    txtnmd = regex.sub(r"```", "", txtin, flags=regex.IGNORECASE) #get rid of markdown code ticks
    txtj1 = regex.search(r'{[^{}]+}', txtnmd).group() # get inner json brackets only
    txtfnl = regex.sub("\"\"", "\", \" ", txtj1, flags=regex.IGNORECASE) #ex double ""
    txtne = regex.sub(r"\\'", "'", txtfnl, flags=regex.IGNORECASE) #comma between ""
    txtcln = txtne[:(txtne.rfind('"')+1)] + "}" #nothin between last " and }

    return json.loads(txtcln)

Obviously you would need to put rules that address your JSON quirks… this issue arose with a different LLM / different use case, but repair was successful enough to get on with the problem I really cared about…