I was working with Mistral Instruct 7B and realized that when fine-tuning it, I had a model that keeps generating undefinitely. While following very common guides and feedbacks, I realized that a common mistake done was to define the pad_token as the eos_token. This leads to a dramatic result as the DataCollator will mask every pad_token to -100 labels.
Hence, the fine-tuned model never sees the eos_token and keeps generating undefinitely. Here is the extract of the DataCollator function:
labels = batch[âinput_idsâ].clone()
if self.tokenizer.pad_token_id is not None:
labels[labels == self.tokenizer.pad_token_id] = -100
batch[âlabelsâ] = labels
I would advise to fix it quickly, either by communicating on it or making a difference between pad and eos id using their mask values.
So far, my way to do was to set the pad_id to a very large value and modifiy the data collator to replace by -100 this value.
Hi. Iâm also fine tuning Mistral-7b Instruct model on a mental health dataset (~90k rows) using NVIDIA A100 GPUs. I am running into two issues:
1- The model keeps generating text until it reaching the max_new_token limit.
2- When it does stop, it stops abruptly, without finishing the sentence/paragraph/word.
I am not sure what value to assign to the pad_token_id? Could you please guide me on what value worked for you?
That is simple but could result in some unexpected behavior because:
Unknown Words: The <unk> (unknown) token is designed to represent out-of-vocabulary wordsâwords that were not seen during training or are not included in the modelâs vocabulary.
Semantic Meaning: The <unk> token carries a semantic implication that something is unknown, which is different from the purpose of pad
I recommend creating a new padding token, that way we donât have to modify any code in the DataCollatorForLanguageModelling class, while ensuring that we get a correct attention mask, ensuring correct generation. Also, donât forget to update the vocabulary of the model you are using, to prevent CUDA assertion triggers.