I’m a little puzzled where (and if) EOS tokens are being added when using Huggignface’s trainer classes to train a T5 (LongT5 actually) model.
The data set contains pairs of text like this:
from | to |
---|---|
some text | some corresponding text |
some other text | some other corresponding text |
The tokenizer has been custom trained:
tokenizer = SentencePieceUnigramTokenizer()
tokenizer.train_from_iterator(iterator=iterator, vocab_size=32_128, show_progress=True, unk_token="<unk>")
and is loaded like this:
tokenizer = T5TokenizerFast(tokenizer_file="data-rb-25000/tokenizer.json",
padding=True, bos_token="<s>",
eos_token="</s>",unk_token="<unk>",
pad_token="<pad>")
Before training, the data set is tokenized and examples that have a too high token count are filtered out, like so:
MAX_SEQUENCE_LENGTH = 16_384 / 2
def preprocess_function(examples):
inputs = tokenizer(
examples['from'],
truncation=False, # Don't truncate yet
padding=False, # Don't pad yet
return_length=True,
)
labels = tokenizer(
examples['to'],
truncation=False,
padding=False,
return_length=True,
)
inputs["input_length"] = inputs["length"]
inputs["labels"] = labels["input_ids"]
inputs["label_length"] = labels["length"]
inputs.pop("length", None)
return inputs
tokenized_data = dataset.map(preprocess_function, batched=True, remove_columns=dataset["train"].column_names)
def filter_function(example):
return example['input_length'] <= MAX_SEQUENCE_LENGTH and example['label_length'] <= MAX_SEQUENCE_LENGTH
filtered_data = tokenized_data.filter(filter_function)
Training is done like this:
from transformers import DataCollatorForSeq2Seq
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model="google/long-t5-tglobal-base")
from transformers import AutoModelForSeq2SeqLM, AutoConfig
config = AutoConfig.from_pretrained(
"google/long-t5-tglobal-base",
vocab_size=len(tokenizer),
pad_token_id=tokenizer.pad_token_id,
eos_token_id=tokenizer.eos_token_id,
decoder_start_token_id=tokenizer.pad_token_id,
)
model = AutoModelForSeq2SeqLM.from_config(config)
from transformers import GenerationConfig
generation_config = GenerationConfig.from_model_config(model.config)
generation_config._from_model_config = False
generation_config.max_new_tokens = 16_384
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments
training_args = Seq2SeqTrainingArguments(
output_dir="rb-25000-model",
eval_strategy="epoch",
save_strategy="epoch",
learning_rate=2e-5,
per_device_train_batch_size=1,
per_device_eval_batch_size=1,
gradient_accumulation_steps=16,
gradient_checkpointing=True,
weight_decay=0.01,
save_total_limit=3,
num_train_epochs=5,
logging_steps=1,
predict_with_generate=True,
load_best_model_at_end=True,
bf16=True,
)
trainer = Seq2SeqTrainer(
model=model,
args=training_args,
train_dataset=filtered_data["train"],
eval_dataset=filtered_data["test"],
tokenizer=tokenizer,
data_collator=data_collator,
compute_metrics=compute_metrics,
generation_config=generation_config,
)
trainer.train()
I know that the tokenizer doesn’t add the EOS token:
inputs = tokenizer(['Hello world', 'Hello'], padding=True, truncation=True, max_length=100, return_tensors="pt")
labels = inputs["input_ids"]
print(labels)
print(tokenizer.convert_tokens_to_ids(['<s>'])[0])
print(tokenizer.convert_tokens_to_ids(['<pad>'])[0])
print(tokenizer.convert_tokens_to_ids(['<unk>'])[0])
print(tokenizer.convert_tokens_to_ids(['</s>'])[0])
print(tokenizer.convert_ids_to_tokens([1]))
Output:
tensor([[1, 10356, 1, 5056],
[1, 10356, 16002, 16002]])
16000
16002
0
16001
['▁']
(I don’t really understand what’s that strange token with index 1.
Anyway, I was wondering if the Trainer class or the DataCollator actually adds the EOS. I did not find any examples online of how and where to add EOS.
I suspect it’s not there, because after training the model it doesn’t stop generating until it reaches max_new_tokens (set to pretty high).
What’s the best practice here? Where should I add EOS? Is there anything else about this code that should be checked or that looks weird for more experienced eyes?
Thank you!