Mistral trouble when fine-tuning : Don't set pad_token_id = eos_token_id

samchain · March 18, 2024, 12:44pm

Hello everyone,

I was working with Mistral Instruct 7B and realized that when fine-tuning it, I had a model that keeps generating undefinitely. While following very common guides and feedbacks, I realized that a common mistake done was to define the pad_token as the eos_token. This leads to a dramatic result as the DataCollator will mask every pad_token to -100 labels.

Hence, the fine-tuned model never sees the eos_token and keeps generating undefinitely. Here is the extract of the DataCollator function:

labels = batch[“input_ids”].clone()
if self.tokenizer.pad_token_id is not None:
labels[labels == self.tokenizer.pad_token_id] = -100
batch[“labels”] = labels

I would advise to fix it quickly, either by communicating on it or making a difference between pad and eos id using their mask values.

So far, my way to do was to set the pad_id to a very large value and modifiy the data collator to replace by -100 this value.

Those guides are making the mistake:

Hope that helps and could raise a PR

Rmote6603 · April 2, 2024, 10:38pm

Hi. I’m also fine tuning Mistral-7b Instruct model on a mental health dataset (~90k rows) using NVIDIA A100 GPUs. I am running into two issues:

1- The model keeps generating text until it reaching the max_new_token limit.
2- When it does stop, it stops abruptly, without finishing the sentence/paragraph/word.

I am not sure what value to assign to the pad_token_id? Could you please guide me on what value worked for you?

samchain · April 3, 2024, 12:46pm

Hello !

I did it in three steps:

First pre-tokenize and pad the dataset outside of the data collator function with pad_id = eos_token
Apply a function which set the attention mask_id value to 1 for the first eos_token_id in the sequence
Train the model on the tokenized examples

Rmote6603 · April 3, 2024, 3:52pm

Hi. Thanks for responding back. Actually I’m not using the data collator function. Here’s my code snippet.

base_model = “mistralai/Mistral-7B-Instruct-v0.2”

eos_token=“[/INST]”

bnb_config = BitsAndBytesConfig(
load_in_4bit= True,
bnb_4bit_quant_type= “nf4”,
bnb_4bit_compute_dtype= torch.bfloat16,
bnb_4bit_use_double_quant= True,
llm_int8_enable_fp32_cpu_offload=True
)
model = AutoModelForCausalLM.from_pretrained(
base_model,
quantization_config=bnb_config

device_map=“auto”,

)
model.config.use_cache = False # silence the warnings. Please re-enable for inference!
model.config.pretraining_tp = 1
model.gradient_checkpointing_enable()

Load tokenizer

tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)

tokenizer.padding_side = ‘right’
tokenizer.add_eos_token = True
tokenizer.max_new_tokens=2000
tokenizer.max_length=200
tokenizer.max_new_length=200
tokenizer.pad_token_id=2041
tokenizer.pad_token = tokenizer.unk_token
eos_token_id=tokenizer.eos_token_id
model.config.pad_token_id = tokenizer.pad_token_id

Let me know your comments. Thanks!

grg · April 7, 2024, 2:51pm

A simple solution is to use another token as the pad token, i.e.

tokenizer.pad_token = tokenizer.unk_token

That way, you still train on the eos tokens, but not on the paddings.

Ayushnangia · April 7, 2024, 3:36pm

any solution for stoping the model before its max token length

sidharth147 · May 18, 2024, 7:09am

Hey could you a figure out a solution for this? Please let me know if you did, thanks!

shubhampatel77 · August 28, 2024, 12:45am

That is simple but could result in some unexpected behavior because:

Unknown Words: The <unk> (unknown) token is designed to represent out-of-vocabulary words—words that were not seen during training or are not included in the model’s vocabulary.
Semantic Meaning: The <unk> token carries a semantic implication that something is unknown, which is different from the purpose of pad

I recommend creating a new padding token, that way we don’t have to modify any code in the DataCollatorForLanguageModelling class, while ensuring that we get a correct attention mask, ensuring correct generation. Also, don’t forget to update the vocabulary of the model you are using, to prevent CUDA assertion triggers.

tokenizer.add_special_tokens({'pad_token': '<pad>'})
model.resize_token_embeddings(len(tokenizer))

system · August 28, 2024, 10:17pm

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Fine tuning Mistal Model with custom PAD token Beginners	1	299	October 1, 2024
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation Beginners	5	46158	September 24, 2024
I want fine tune my LLM (falcon-7b) to learn to stop : Which strategy? Beginners	0	1200	August 9, 2023
GPT2 finetuned with eos token will never yield eos token during generation Beginners	6	3365	April 12, 2024
How to set the Pad Token for meta-llama/Llama-3 Models Models	6	11965	August 29, 2024

Mistral trouble when fine-tuning : Don't set pad_token_id = eos_token_id

eos_token=“[/INST]”

device_map=“auto”,

Load tokenizer

Related topics