I’m using Seq2SeqTrainer
to train my model with a custom tokenizer. The base model is BART Chinese (fnlp/bart-base-chinese
). If the original tokenizer of BART Chinese is used, the output is normal. Yet when I swap the tokenizer with another tokenizer that I made, the output of compute_metrics
, specifically the preds
part of EvalPrediction
is incorrect (the decoded text becomes garbage).
The codes are as follows:
model = BartForConditionalGeneration.from_pretrained(checkpoint)
model.resize_token_embeddings(len(tokenizer))
model.config.vocab_size = len(tokenizer)
steps = 500 # small value for debug purpose
batch_size = 4
training_args = CustomSeq2SeqTrainingArguments(
output_dir = "my_output_dir",
evaluation_strategy = IntervalStrategy.STEPS,
optim = "adamw_torch",
eval_steps = steps,
logging_steps = steps,
save_steps = steps,
learning_rate = 2e-5,
per_device_train_batch_size = batch_size,
per_device_eval_batch_size = batch_size,
weight_decay = 0.01,
save_total_limit = 1,
num_train_epochs = 30,
predict_with_generate = True,
remove_unused_columns = False,
fp16 = True, # save memory
metric_for_best_model = "bleu",
load_best_model_at_end = True,
report_to = "wandb",
# HuggingFace Hub related
hub_token = hf_token,
push_to_hub = True,
save_safetensors = True,
)
trainer = Seq2SeqTrainer(
model = model,
args = training_args,
train_dataset = tokenized_train_dataset,
eval_dataset = tokenized_eval_dataset,
tokenizer = tokenizer,
data_collator = data_collator,
compute_metrics = compute_metrics,
callbacks = [EarlyStoppingCallback(early_stopping_patience=3)],
)
which the tokenizer
is my custom tokenizer. The result is normal if my tokenizer uses the original tokenizer (tokenizer = BertTokenizer.from_pretrained(checkpoint)
).
For the compute_metrics
, it is as follows:
def postprocess_text(preds, labels):
preds = [pred.strip() for pred in preds]
labels = [[label.strip()] for label in labels]
return preds, labels
def compute_metrics(eval_preds):
preds, labels = eval_preds
print("Preds and Labels:", preds[0], labels[0])
if isinstance(preds, tuple):
preds = preds[0]
decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
print("Decoded Preds (before postprocess):", decoded_preds[0])
print("Decoded Labels (before postprocess):", decoded_labels[0])
decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)
print("Decoded Preds:", decoded_preds[0])
print("Decoded Labels:", decoded_labels[0])
result_bleu = metric_bleu.compute(predictions=decoded_preds, references=decoded_labels, tokenize='zh')
result_chrf = metric_chrf.compute(predictions=decoded_preds, references=decoded_labels, word_order=2)
results = {"bleu": result_bleu["score"], "chrf": result_chrf["score"]}
prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
results["gen_len"] = np.mean(prediction_lens)
results = {k: round(v, 4) for k, v in results.items()}
return results
From the debug message, the output sentence does not make sense and consists of weird characters only. I think the model does not recognize the token IDs produced by my custom tokenizer.
How should I tackle this problem? My goal is to train the model with my custom tokenizer.