Mistral trouble when fine-tuning : Don't set pad_token_id = eos_token_id

Hello everyone,

I was working with Mistral Instruct 7B and realized that when fine-tuning it, I had a model that keeps generating undefinitely. While following very common guides and feedbacks, I realized that a common mistake done was to define the pad_token as the eos_token. This leads to a dramatic result as the DataCollator will mask every pad_token to -100 labels.

Hence, the fine-tuned model never sees the eos_token and keeps generating undefinitely. Here is the extract of the DataCollator function:

labels = batch[“input_ids”].clone()
if self.tokenizer.pad_token_id is not None:
labels[labels == self.tokenizer.pad_token_id] = -100
batch[“labels”] = labels

I would advise to fix it quickly, either by communicating on it or making a difference between pad and eos id using their mask values.

So far, my way to do was to set the pad_id to a very large value and modifiy the data collator to replace by -100 this value.

Those guides are making the mistake:

Hope that helps and could raise a PR

1 Like

Hi. I’m also fine tuning Mistral-7b Instruct model on a mental health dataset (~90k rows) using NVIDIA A100 GPUs. I am running into two issues:

1- The model keeps generating text until it reaching the max_new_token limit.
2- When it does stop, it stops abruptly, without finishing the sentence/paragraph/word.

I am not sure what value to assign to the pad_token_id? Could you please guide me on what value worked for you?

Hello !

I did it in three steps:

  • First pre-tokenize and pad the dataset outside of the data collator function with pad_id = eos_token
  • Apply a function which set the attention mask_id value to 1 for the first eos_token_id in the sequence
  • Train the model on the tokenized examples

Hi. Thanks for responding back. Actually I’m not using the data collator function. Here’s my code snippet.

base_model = “mistralai/Mistral-7B-Instruct-v0.2”


bnb_config = BitsAndBytesConfig(
load_in_4bit= True,
bnb_4bit_quant_type= “nf4”,
bnb_4bit_compute_dtype= torch.bfloat16,
bnb_4bit_use_double_quant= True,
model = AutoModelForCausalLM.from_pretrained(


model.config.use_cache = False # silence the warnings. Please re-enable for inference!
model.config.pretraining_tp = 1

Load tokenizer

tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)

tokenizer.padding_side = ‘right’
tokenizer.add_eos_token = True
tokenizer.pad_token = tokenizer.unk_token
model.config.pad_token_id = tokenizer.pad_token_id

Let me know your comments. Thanks!

A simple solution is to use another token as the pad token, i.e.

tokenizer.pad_token = tokenizer.unk_token

That way, you still train on the eos tokens, but not on the paddings.

any solution for stoping the model before its max token length