The HF falcon tutorial has the following line:
tokenizer.pad_token = tokenizer.eos_token
it looks strange to me. It make sense pad and eos are the same but then why even make a difference between them in the first place in general?
cross:
The HF falcon tutorial has the following line:
tokenizer.pad_token = tokenizer.eos_token
it looks strange to me. It make sense pad and eos are the same but then why even make a difference between them in the first place in general?
cross:
From my understanding falcon doesn’t have a pad token defined in the model config, that’s why you define the if statement to avoid getting an error with missing pad token.
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
model.config.pad_token_id = model.config.eos_token_id
but why not use the pad token? are you implying the pad token was not used during training so there is no point in using the pad token as the pad oken and instead use the eos?
For now I’m convinced .pad_token = eos_token
is fine for decoder models (even when fine-tuning).
Assume we do eos = pad. Then, the model is trained to predict eos more often, since the loss pad token = the eos token so it doesn’t mask out the extra eos tokens. However, transformers are conditional models. Therefore, in a decoder only model (which is the case I care about), the model only increases eos given it already predicted eos. But at inference we would stop anyway, so it doesn’t matter since we are conditionally weighting eos only if more eos have been seen. Also, if the pad token is never trained on it should always have a low chance so it likely won’t be an issue. In addition if for some reason the pad was trained on (assuming no bugs) then it will be predicted only after a eos. Worst case at inference treat pad as eos to stop generation. For decoders it’s fine. For encoders-decoders it might be an issue since the encoder will encode eos more than usual, which is more of an issue since longer seqs will get eos more often artificially.
See my argument here: https://chat.openai.com/share/ebb9a9a2-71d3-4c97-a727-b6042494b9a9
context: falcon_peft.py · GitHub
Actually I think this discussion is correct: https://github.com/huggingface/transformers/issues/22794 but need to think through it.
Why is this the case? seem really bizzare to me.
Darn this still not works:
UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use a generation configuration file (see https://huggingface.co/docs/transformers/main_classes/text_generation)
code:
"""
sfttrainer (likely using peft) best practices:
https://huggingface.co/docs/trl/main/en/sft_trainer#best-practices
Best practices
Pay attention to the following best practices when training a model with that trainer:
- SFTTrainer always pads by default the sequences to the max_seq_length argument of the SFTTrainer. If none is passed, the trainer will retrieve that value from the tokenizer. Some tokenizers do not provide default value, so there is a check to retrieve the minimum between 2048 and that value. Make sure to check it before training.
- For training adapters in 8bit, you might need to tweak the arguments of the prepare_model_for_int8_training method from PEFT, hence we advise users to use prepare_in_int8_kwargs field, or create the PeftModel outside the SFTTrainer and pass it.
- For a more memory-efficient training using adapters, you can load the base model in 8bit, for that simply add load_in_8bit argument when creating the SFTTrainer, or create a base model in 8bit outside the trainer and pass it.
- If you create a model outside the trainer, make sure to not pass to the trainer any additional keyword arguments that are relative to from_pretrained() method.
todo: why trust_remote_code? I want more details.
"""
import sys
import torch
from peft import LoraConfig
from transformers.modeling_utils import PreTrainedModel
from pdb import set_trace as st
def test_bfloat16_int4(compute_dtype: torch.dtype,
use_4bit,
):
"""
python -c "import torch; print(torch.cuda.get_device_capability());"
todo: check other code test_bfloat16() do we need use_4bit?
"""
if compute_dtype == torch.float16 and use_4bit:
major, _ = torch.cuda.get_device_capability()
if major >= 8:
print("=" * 80)
print("Your GPU supports bfloat16, you can accelerate training with the argument --bfloat16")
print("=" * 80)
def get_model_tokenizer_qlora_falcon7b(
# -- mode args
# model_id = "tiiuae/falcon-7b"
pretrained_model_name_or_path: str = "ybelkada/falcon-7b-sharded-bf16",
use_cache: bool = True,
# -- lora args
lora_alpha=16, # todo
lora_dropout=0.1, # todo, evidence drop out really help? google, crfm, gpt4
lora_r=64, # todo
bnb_4bit_compute_dtype=torch.float16, # changed it from Guanaco hf
# -- training args
output_dir="./results",
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
# paging so that the sudden mem gpu spikes don't cause the run to shut down
# (I think usually caused by too long seqs)
# todo: why 32 bit opt?
# todo: paged nadamw opt?
optim="paged_adamw_32bit",
save_steps=10,
logging_steps=10,
learning_rate=2e-4,
max_grad_norm=0.3,
max_steps=500,
warmup_ratio=0.03,
lr_scheduler_type="constant",
# -- quant. args (not recommended to be changed unless you know what your doing?)
load_in_4bit=True, # load (usually huge) base model in 4 bits
bnb_4bit_quant_type="nf4", # normal float 4 for the (large) base models qlora
) -> tuple:
"""
Load the Falcon 7B model, quantize it in 4bit and attach LoRA adapters on it.
bf16 = 1S, 7Exp, 8Mantissa
hypothesis: 7b trained due to 6.7 emergence rumour, I still don't think emergence is real.
Notes:
- ft a model is very specific to the model, tokenizer and training scheme. Thus we return
- model, tokenizer, ft config (peft config), training args
ref:
- https://colab.research.google.com/drive/1DOi8MFv4SWN9NImVornZ7t6BgmLoPQO-#scrollTo=AjB0WAqFSzlD
"""
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, AutoTokenizer
# - Get bnb config for bit-4 base model (bnb lib for using 4bit qlora quantization techniques by tim dettmers)
bnb_config = BitsAndBytesConfig(
load_in_4bit=load_in_4bit, # load (usually huge) base model in 4 bits
bnb_4bit_quant_type=bnb_4bit_quant_type, # normal float 4 for the (usually huge) base model
bnb_4bit_compute_dtype=bnb_4bit_compute_dtype, # if you can, during computation use bf16
)
# - Get falcon 4bit model
# todo, where is this being saved & how to download quicker
model = AutoModelForCausalLM.from_pretrained(
pretrained_model_name_or_path=pretrained_model_name_or_path,
quantization_config=bnb_config,
trust_remote_code=True # allows to execute custom code you download from the uploaded model code you are using
)
print(f'{type(model)=}')
print(f'{model=}')
# this is here to save gpu vram. Likely only needed when using 40b or when oom issues happen ref: https://stackoverflow.com/questions/76633335/why-does-hugging-face-falcon-model-use-mode-config-use-cache-false-why-wouldn
model.config.use_cache = use_cache
print(f'{type(model)=}')
# - Get falcon tokenizer
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path,
trust_remote_code=True) # execs code downloaded from hf hub
# tokenizer.pad_token = tokenizer.eos_token # ref: https://stackoverflow.com/questions/76633368/why-does-the-falcon-qlora-tutorial-code-use-eos-token-as-pad-token
# tokenizer.add_special_tokens({'pad_token': '[PAD]'}) # I think this is fine if during the training pad is ignored
tokenizer.add_special_tokens({'pad_token': '<|pad|>'}) # I think this is fine if during the training pad is ignored
# - Modify model
# add pad token embed
model.resize_token_embeddings(len(tokenizer)) # todo: I think this is fine if during the training pad is ignored
model.transformer.word_embeddings.padding_idx = len(tokenizer) - 1
model.config.max_new_tokens = len(tokenizer)
# model.config.min_length = 1
print(f'{model=}')
print(f'{type(tokenizer)=}')
print(f'{tokenizer.pad_token=}')
# data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False) todo
# - Get falcon lora config
peft_config = LoraConfig(
lora_alpha=lora_alpha,
lora_dropout=lora_dropout,
r=lora_r,
bias="none",
task_type="CAUSAL_LM",
# model card for falcon tiiuae/falcon-7b: https://huggingface.co/tiiuae/falcon-7b/blob/main/modelling_RW.py
# does seem to include all trainable params as done by qlora on their own paper
target_modules=[
# word_embeddings,
"query_key_value",
"dense",
"dense_h_to_4h",
"dense_4h_to_h",
# "lm_head"
]
)
print(f'{type(peft_config)=}')
# todo: print the num params of the lora = D1*r + D2*r and num of bytes by prec. (bytes) * num params
return model, tokenizer, peft_config
# -- tests
def example_test_model_already_has_pad_token():
"""
if it already has pad token, it likely has a small prob, so we are done.
compare it's norm with other tokens to verify this is true.
python ~/ultimate-utils/ultimate-utils-proj-src/uutils/hf_uu/model_tokenizer/falcon_uu_mdl_tok.py
"""
# - the get datasets todo: preprocessing, padding, streaming
from uutils.hf_uu.data_hf.common import get_guanaco_datsets_add_splits_train_test_only
trainset, _, testset = get_guanaco_datsets_add_splits_train_test_only()
# qlora flacon7b
from uutils.hf_uu.model_tokenizer.falcon_uu_mdl_tok import get_model_tokenizer_qlora_falcon7b
model, tokenizer, peft_config = get_model_tokenizer_qlora_falcon7b()
model: PreTrainedModel = model
print(f'{model=}')
sent = 'Dogs are great because they are '
print()
# print to see if pad tokens are present and if it ignores the tokens at the end
encoded_input = tokenizer(sent, padding='max_length', max_length=10, return_tensors='pt')
print(f'{encoded_input=}')
# Print all special tokens
print('\n---- start Print all special tokens')
for token_name, token in tokenizer.special_tokens_map.items():
print(f"{token_name}: {token}")
print('\n---- end Print all special tokens')
# Get the ID for the '[PAD]' token
try:
pad_token_id = tokenizer.convert_tokens_to_ids('[PAD]')
except KeyError:
raise ValueError("Token [PAD] is not present in the tokenizer vocabulary.")
# Index into the model's embedding table
try:
print(f'{model.get_input_embeddings().weight.size()=}')
pad_embedding = model.get_input_embeddings().weight[pad_token_id]
except IndexError:
raise ValueError(f"Token ID {pad_token_id} is not present in the model's embedding matrix.")
print(f'{pad_embedding=}')
print('Success!\n')
# check it generates something sensible
# tokenizer.decode(model.generate(**tokenizer(sent, return_tensors='pt'), do_sample=True)[0])
input_ids, attention_mask = encoded_input['input_ids'], encoded_input['attention_mask']
predicted_tokens_ids_options = model.generate(input_ids=input_ids, attention_mask=attention_mask, do_sample=True)
predicted_tokens_ids = predicted_tokens_ids_options[0]
predicted_sent = tokenizer.decode(predicted_tokens_ids)
print(f'original sentence: {sent=}')
print(f'predicted sentence: {predicted_sent=}')
print('Success2!')
if __name__ == '__main__':
import time
start_time = time.time()
example_test_model_already_has_pad_token()
print(f"The main function executed in {time.time() - start_time} seconds.\a")
it doesn’t like the modifications to the model:
model.transformer.word_embeddings.padding_idx = len(tokenizer) - 1
model.config.max_new_tokens = len(tokenizer)
How to fix?
Honestly no idea. Researching it
Yes I agree that pad is assigned to eos. Eos is still eos. But during fine-tuning now the weights wrt to eos are unchanged. This might be an issue since the probability of eos has not shifted to the fine-tuning regime. One possibility is that eos is outputed with less chance. Yes we can still halt production when we see eos but we’ve not shifted the probability to output eos according to our fine-tuning distribution – but all other tokens have changed distribution. I think this could be an issue because it’s not like the old probability of eos is conserved since all tokens probs have changed except eos + even if the old eos prob was conserved, it’s wrt wrong distribution (not the fine tuning one).
e.g.,
if tokenizer.pad_token_id is None:
tokenizer.pad_token = tokenizer.eos_token
...
raw_text_batch='a'
tokenize_batch={'input_ids': tensor([[ 64, 50256, 50256, 50256, 50256]]), 'attention_mask': tensor([[1, 0, 0, 0, 0]])}
but it would have been better to have
tokenize_batch={'input_ids': tensor([[ 64, 50256, 50256, 50256, 50256]]), 'attention_mask': tensor([[1, 1, 0, 0, 0]])}
code
def test_eos_pad():
from datasets import load_dataset
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel
raw_text_batch = 'a'
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
# print(f'{tokenizer.eos_token=}')
# print(f'{tokenizer.eos_token_id=}')
# print(f'{tokenizer.pad_token=}')
# print(f'{tokenizer.pad_token_id=}')
# print(f'{raw_text_batch=}')
# tokenize_batch = tokenizer(raw_text_batch, padding="max_length", max_length=5, truncation=True, return_tensors="pt")
# print(f'{tokenize_batch=}')
if tokenizer.pad_token_id is None:
tokenizer.pad_token = tokenizer.eos_token
probe_network = GPT2LMHeadModel.from_pretrained("gpt2")
device = torch.device(f"cuda:{0}" if torch.cuda.is_available() else "cpu")
probe_network = probe_network.to(device)
print(f'{tokenizer.eos_token=}')
print(f'{tokenizer.eos_token_id=}')
print(f'{tokenizer.pad_token=}')
print(f'{tokenizer.pad_token_id=}')
print(f'{raw_text_batch=}')
tokenize_batch = tokenizer(raw_text_batch, padding="max_length", max_length=5, truncation=True, return_tensors="pt")
print(f'{tokenize_batch=}')
print('Done')
I’m still confused
"
if a model does not have a padding token already (which is common for decoder-only models because they are trained on blocks which do not have any padding). So you never “unlearn” anything.
"
is true, but then during training eos and pad will be masked. So there is a “wrong” distribution shift for generating EOS now. How to fix this? See details above.
Hi all! There’s an interesting story here.
In general you are correct that causal LMs like Falcon are not trained with a pad token, and so the tokenizer does not have one set. This is true for a lot of causal LMs in the Hub. During training, these models are often fed sequences that have been concatenated together and truncated at the maximum sequence length, and so there is never any empty space that needs padding.
The reason we add one later is because a lot of downstream methods use padding and attention masks in some way. However, in many cases it doesn’t really matter what you set the padding token to! This is because the padded tokens will generally be masked by setting the attention_mask to 0, so those tokens will not be attended to by the rest of the sequence.
However, one place the choice of padding token can matter is in the labels when fine-tuning the model. This is because in standard CLM training, the labels are the inputs, shifted by a single position. This would mean that in the final position of the sequence before the padding at the end, the label at that position will be the padding token. When training models with shorter sequences (such as for chat), we generally want them to mark the end of the text they’ve generated, using a token like eos_token
. As a result, we commonly just use eos_token
as the padding token.
However, depending on your fine-tuning task, you may not want the model to learn to predict eos_token
at the end of a sequence - if this is the case, simply change the label at that position to the token you do want, or set the label to -100
to mask the label at that position.
Does that answer the questions you had? Feel free to let me know if I missed anything here!
Yes this is what I was going to do because I’m doing fine-tuning for code where syntax matters.
But I need the code. I’ve not had time to write it down. When I do I will share here. To clarify this is what I plan to do:
In the collate function for all seqs in the batch switch the final mask to 1 where the first EOS token is at.
why -100? what does this achieve?
The value
-100
is a special token ID used by HuggingFace’s Transformers library to indicate that a particular token should be ignored when computing the loss.
why not mask = 0 for the indices you want to not train on?
Ok I think this is the code:
def custom_collate_fn_train_on_first_eos_occurrence(data: list[dict[str, str]], tokenizer: PreTrainedTokenizer) -> dict[str, torch.Tensor]:
# Ensure tokenizer has a padding token
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
# Extract sequences
sequences: list[str] = [example.get("text", "") or "" for example in data]
# Tokenize the sequences
tokenized_data = tokenizer(sequences, padding="max_length", max_length=context_length, truncation=True, return_tensors="pt")
# Clone input_ids to labels
tokenized_data["labels"] = tokenized_data["input_ids"].clone()
# Set the mask value for the first eos_token in each sequence to 1
eos_token_id = tokenizer.eos_token_id
for idx, input_ids in enumerate(tokenized_data["input_ids"]):
# Find all occurrences of eos_token
eos_positions = (input_ids == eos_token_id).nonzero(as_tuple=True)[0]
if eos_positions.nelement() > 0: # Check if eos_token is present
first_eos_position = eos_positions[0]
tokenized_data["attention_mask"][idx, first_eos_position] = 1 # Set the mask value to 1
# Assert that the label for the first occurrence of eos_token is eos_token_id
assert tokenized_data["labels"][idx, first_eos_position] == eos_token_id, "The label for the first eos_token is incorrect!"
# For all subsequent occurrences of eos_token, set their labels to -100
for subsequent_eos_position in eos_positions[1:]:
assert tokenized_data["labels"][idx, subsequent_eos_position] == -100, "The label for the first eos_token is incorrect!"
# tokenized_data["labels"][idx, subsequent_eos_position] = -100
return tokenized_data
It seems like you know a lot about how this works. So, if setting tokenizer.pad_token = tokenizer.eos_token
causes falcon to infinitely generate text up to the cutoff point, how do you stop this from happening? Do you have time to provide a code snippet? All I can think of is:
raw_pad_token = “<pad>”
processed_token = tokenizer(raw_pad_token)
tokenizer.pad_token = processed_token
But based on this thread, this isn’t enough to work
Hi @brando @maxolotl @Rocketknight1
Best way to fix this issue is to change the processing template:
from transformers import AutoTokenizer
from tokenizers.processors import TemplateProcessing
text = "Random text"
tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-7b")
print(tokenizer(text)) # base tokenizer
# {'input_ids': [25070, 2288], 'token_type_ids': [0, 0], 'attention_mask': [1, 1]}
tokenizer._tokenizer.post_processor = TemplateProcessing(
single="$A " + tokenizer.eos_token,
pair="$A "+ tokenizer.eos_token +" $B:1 "+ tokenizer.eos_token +":1",
special_tokens=[(tokenizer.eos_token, tokenizer.eos_token_id)],
)
print(tokenizer(text)) # Updated tokenizer with EOS token
# {'input_ids': [25070, 2288, 11], 'token_type_ids': [0, 0, 0], 'attention_mask': [1, 1, 1]}
tokenizer.pad_token = tokenizer.eos_token
tokenizer.model_max_length = 5
print(tokenizer(text, padding="max_length")) # Updated tokenizer with EOS token and padding
# {'input_ids': [25070, 2288, 11, 11, 11], 'token_type_ids': [0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 0, 0]}
Note that the model has to learn to predict the eos token through causal language modeling.
but the distribution shift is conditionally since decoders are autoregressive i.e., only (or mostly) eos probability will be increased given any number of eos tokens have already been seen. Is my prediction. But I’ve seen in other places that a fine tuned model with eos = pad
predicts eos way too much e.g., only predicts eos. So one way to fix this is to mask the remaining eos if this is really an issue.
Code for that:
# -- Define custom collate function
def custom_collate_fn(data: list[dict[str, str]], tokenizer: PreTrainedTokenizer) -> dict[str, torch.Tensor]:
""" trains on first occurence of eos
ref: https://discuss.huggingface.co/t/why-does-the-falcon-qlora-tutorial-code-use-eos-token-as-pad-token/45954/13?u=brando
ref: https://chat.openai.com/share/02d16770-a1f3-4bf4-8fc2-464286daa8a1
ref: https://claude.ai/chat/80565d1f-ece3-4fad-87df-364ce57aec15 on when to call .clone()
"""
# we are training full context length forllama so remove code bellow, if it triesto pad hopefully it throws an error
# -- Ensure tokenizer has a padding token
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
# -- Extract sequences
# sequences: list[str] = [example.get("text", "") or "" for example in data]
sequences: list[str] = []
for idx, example in enumerate(data):
# Retrieve the value for "text" from the dictionary or default to an empty string if not present or falsy. ref: https://chat.openai.com/share/bead51fe-2acf-4f05-b8f7-b849134bbfd4
text: str = example.get("text", "") or ""
sequences.append(text)
# -- Tokenize the sequences
tokenized_data = tokenizer(sequences, padding="max_length", max_length=max_length, truncation=True, return_tensors="pt")
tokenized_data["labels"] = tokenized_data["input_ids"].clone() # labels is hardcoded in HF so put it!
# -- Set the mask value for the first eos_token in each sequence to 1
eos_token_id = tokenizer.eos_token_id
for idx, input_ids in enumerate(tokenized_data["input_ids"]):
# Find all occurrences of eos_token
eos_positions = (input_ids == eos_token_id).nonzero(as_tuple=True)[0]
if eos_positions.nelement() > 0: # Check if eos_token is present
first_eos_position = eos_positions[0]
tokenized_data["attention_mask"][idx, first_eos_position] = 1 # Set the mask value to 1
# Assert that the label for the first occurrence of eos_token is eos_token_id
assert tokenized_data["labels"][idx, first_eos_position] == eos_token_id, "The label for the first eos_token is incorrect!"
# For all subsequent occurrences of eos_token, set their labels to -100
for subsequent_eos_position in eos_positions[1:]:
tokenized_data["labels"][idx, subsequent_eos_position] = -100
assert tokenized_data["labels"][idx, subsequent_eos_position] == -100, "The label for the subsequent_eos_position incorrect! Should be -100."
return tokenized_data
@ccdv can you explain why you are doing this? What is it meant to do and how does it solve our problem? Thanks in advance!
I currently like this answer: Discord
If you are really bothered by this, you can write a custom data collator that masks the all-but-first EOS token. In my experience that is not necessary.
Personally this is the solution I recommend. But I’ve not tested if Falcon stops producing too many EOSs:
def collate_fn_train_only_first_eos_token_mask_everything_after_it(data: list[dict[str, str]],
tokenizer: PreTrainedTokenizer,
max_length: int=1024, # GPT2 default, likely worth you change it! This default might cause bugs.
) -> dict[str, torch.Tensor]:
""" Train only on first occurence of eos. The remaining eos are masked out.
Sometimes the model might not have a padding token. Sometimes people set the padding token to be the eos token.
But sometimes this seems to lead to the model to predict eos token to much.
So instead of actually using the pad token that was set to the eos token, we instead mask out all excesive eos tokens that act as pads
and leave the first eos token at the end to be predicted -- since that is the only one that semantically means end of sequence
and therby by not training on random eos at the end by masking it not unncesserily shift/amplify the distribution of eos.
ref: https://discuss.huggingface.co/t/why-does-the-falcon-qlora-tutorial-code-use-eos-token-as-pad-token/45954/13?u=brando
ref: https://chat.openai.com/share/02d16770-a1f3-4bf4-8fc2-464286daa8a1
ref: https://claude.ai/chat/80565d1f-ece3-4fad-87df-364ce57aec15 on when to call .clone()
"""
# we are training full context length for llama so remove code bellow, if it tries to pad hopefully it throws an error
# -- Ensure tokenizer has a padding token
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
# -- Extract sequences
# sequences: list[str] = [example.get("text", "") or "" for example in data]
sequences: list[str] = []
for idx, example in enumerate(data):
# Retrieve the value for "text" from the dictionary or default to an empty string if not present or falsy. ref: https://chat.openai.com/share/bead51fe-2acf-4f05-b8f7-b849134bbfd4
text: str = example.get("text", "") or ""
sequences.append(text)
# -- Tokenize the sequences
tokenized_data = tokenizer(sequences, padding="max_length", max_length=max_length, truncation=True, return_tensors="pt")
tokenized_data["labels"] = tokenized_data["input_ids"].clone() # labels is hardcoded in HF so put it!
# -- Set the mask value for the first eos_token in each sequence to 1 and remaining to -100
eos_token_id = tokenizer.eos_token_id
for idx, input_ids in enumerate(tokenized_data["input_ids"]):
# Find all occurrences of eos_token
eos_positions = (input_ids == eos_token_id).nonzero(as_tuple=True)[0]
if eos_positions.nelement() > 0: # Check if eos_token is present
first_eos_position = eos_positions[0]
tokenized_data["attention_mask"][idx, first_eos_position] = 1 # Set the mask value to 1
# Assert that the label for the first occurrence of eos_token is eos_token_id
assert tokenized_data["labels"][idx, first_eos_position] == eos_token_id, "The label for the first eos_token is incorrect!"
# For all subsequent occurrences of eos_token, set their labels to -100
for subsequent_eos_position in eos_positions[1:]:
tokenized_data["labels"][idx, subsequent_eos_position] = -100
assert tokenized_data["labels"][idx, subsequent_eos_position] == -100, "The label for the subsequent_eos_position incorrect! Should be -100."
return tokenized_data
This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.