Seq2SeqTrainer produces incorrect EvalPrediction after changing another Tokenizer

I’m using Seq2SeqTrainer to train my model with a custom tokenizer. The base model is BART Chinese (fnlp/bart-base-chinese). If the original tokenizer of BART Chinese is used, the output is normal. Yet when I swap the tokenizer with another tokenizer that I made, the output of compute_metrics, specifically the preds part of EvalPrediction is incorrect (the decoded text becomes garbage).

The codes are as follows:

model = BartForConditionalGeneration.from_pretrained(checkpoint)
model.resize_token_embeddings(len(tokenizer))
model.config.vocab_size = len(tokenizer)

steps = 500 # small value for debug purpose
batch_size = 4
training_args = CustomSeq2SeqTrainingArguments(
    output_dir = "my_output_dir",
    evaluation_strategy = IntervalStrategy.STEPS,
    optim = "adamw_torch",
    eval_steps = steps,
    logging_steps = steps,
    save_steps = steps,
    learning_rate = 2e-5,
    per_device_train_batch_size = batch_size,
    per_device_eval_batch_size = batch_size,
    weight_decay = 0.01,
    save_total_limit = 1,
    num_train_epochs = 30,
    predict_with_generate = True,
    remove_unused_columns = False, 
    fp16 = True, # save memory
    metric_for_best_model = "bleu",
    load_best_model_at_end = True,
    report_to = "wandb",
    # HuggingFace Hub related
    hub_token = hf_token,
    push_to_hub = True,
    save_safetensors = True,
)

trainer = Seq2SeqTrainer(
    model = model,
    args = training_args,
    train_dataset = tokenized_train_dataset,
    eval_dataset = tokenized_eval_dataset,
    tokenizer = tokenizer,
    data_collator = data_collator,
    compute_metrics = compute_metrics,
    callbacks = [EarlyStoppingCallback(early_stopping_patience=3)],
)

which the tokenizer is my custom tokenizer. The result is normal if my tokenizer uses the original tokenizer (tokenizer = BertTokenizer.from_pretrained(checkpoint)).

For the compute_metrics, it is as follows:

def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [[label.strip()] for label in labels]

    return preds, labels

def compute_metrics(eval_preds):
    preds, labels = eval_preds

    print("Preds and Labels:", preds[0], labels[0])
    
    if isinstance(preds, tuple):
        preds = preds[0]
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    print("Decoded Preds (before postprocess):", decoded_preds[0])
    print("Decoded Labels (before postprocess):", decoded_labels[0])

    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)
    print("Decoded Preds:", decoded_preds[0])
    print("Decoded Labels:", decoded_labels[0])

    result_bleu = metric_bleu.compute(predictions=decoded_preds, references=decoded_labels, tokenize='zh')
    result_chrf = metric_chrf.compute(predictions=decoded_preds, references=decoded_labels, word_order=2)
    results = {"bleu": result_bleu["score"], "chrf": result_chrf["score"]}

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
    results["gen_len"] = np.mean(prediction_lens)
    results = {k: round(v, 4) for k, v in results.items()}
    return results

From the debug message, the output sentence does not make sense and consists of weird characters only. I think the model does not recognize the token IDs produced by my custom tokenizer.

How should I tackle this problem? My goal is to train the model with my custom tokenizer.