I was working with Mistral Instruct 7B and realized that when fine-tuning it, I had a model that keeps generating undefinitely. While following very common guides and feedbacks, I realized that a common mistake done was to define the pad_token as the eos_token. This leads to a dramatic result as the DataCollator will mask every pad_token to -100 labels.
Hence, the fine-tuned model never sees the eos_token and keeps generating undefinitely. Here is the extract of the DataCollator function:
labels = batch[âinput_idsâ].clone()
if self.tokenizer.pad_token_id is not None:
labels[labels == self.tokenizer.pad_token_id] = -100
batch[âlabelsâ] = labels
I would advise to fix it quickly, either by communicating on it or making a difference between pad and eos id using their mask values.
So far, my way to do was to set the pad_id to a very large value and modifiy the data collator to replace by -100 this value.
Hi. Iâm also fine tuning Mistral-7b Instruct model on a mental health dataset (~90k rows) using NVIDIA A100 GPUs. I am running into two issues:
1- The model keeps generating text until it reaching the max_new_token limit.
2- When it does stop, it stops abruptly, without finishing the sentence/paragraph/word.
I am not sure what value to assign to the pad_token_id? Could you please guide me on what value worked for you?