Different model performance after saving and loading Donut model

Hi @nielsr, I’m seeing a difference in my finetuned Donut model prediction before saving the model and after.

  • Before saving the model, it extracts the information from a document image properly
  • However, after saving and loading the model, it only predicts start and end tokens

I used the following training and inference code earlier this year and there wasn’t an issue. Not sure why this is happening now. I’ve tried the following methods, but still encountered the issue:

  • I tried saving and loading to HF hub and local, also tried loading from checkpoint.
  • Tried using AutoProcessor instead of DonutProcessor
  • Used different methods of saving the model eg. Trainer.save_model(), Trainer.model.save_pretrained()

Appreciate your help on this!

Library versions

  • transformers=4.42.3
  • torchvision=0.18.1
  • torch=2.3.1
  • accelerate=0.31.0
  • huggingface-hub=0.23.4

Training

# Load model from huggingface.co
model = VisionEncoderDecoderModel.from_pretrained("naver-clova-ix/donut-base")

# Resize embedding layer to match vocabulary size
new_emb = model.decoder.resize_token_embeddings(len(processor.tokenizer))
print(f"New embedding size: {new_emb}")
# Adjust our image size and output sequence lengths
model.config.encoder.image_size = processor.feature_extractor.size[::-1] # (height, width)
model.config.decoder.max_length = len(max(processed_dataset["train"]["labels"], key=len))

# Add task token for decoder to start
model.config.pad_token_id = processor.tokenizer.pad_token_id
model.config.decoder_start_token_id = processor.tokenizer.convert_tokens_to_ids(['<s>'])[0]

hf_repository_id = "donut-test"

# Arguments for training
training_args = Seq2SeqTrainingArguments(
    output_dir=hf_repository_id,
    num_train_epochs=1,       #10
    learning_rate=2e-5,
    per_device_train_batch_size=2,
    weight_decay=0.01,
    fp16=True,
    logging_steps=100,
    save_total_limit=2,
    evaluation_strategy="no",
    save_strategy="epoch",
    predict_with_generate=True,
    # push to hub parameters
    report_to="tensorboard",
    push_to_hub=True,
    hub_strategy="every_save",
    hub_model_id=hf_repository_id,
    hub_token=HfFolder.get_token(),
)

# Create Trainer
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=processed_dataset["train"],
)

trainer.train()

processor.save_pretrained(hf_repository_id)
trainer.create_model_card()
trainer.push_to_hub()

Inference

# Load our model from Hugging Face, move to GPU
processor = DonutProcessor.from_pretrained("myrepo/donut-test")
model = VisionEncoderDecoderModel.from_pretrained("myrepo/donut-test")
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

test_image = Image.open("./test_image.jpg")
test_pixel_values = processor(                 
    test_image, return_tensors="pt"
).pixel_values

task_prompt = "<s>"
decoder_input_ids = processor.tokenizer(task_prompt, add_special_tokens=False, return_tensors="pt").input_ids

outputs = model.generate(
    test_pixel_values.to(device),
    decoder_input_ids=decoder_input_ids.to(device),
    max_length=model.decoder.config.max_position_embeddings,
    pad_token_id=processor.tokenizer.pad_token_id,
    eos_token_id=processor.tokenizer.eos_token_id,
    num_beams=1,
    temperature = 0.000001,
    bad_words_ids=[[processor.tokenizer.unk_token_id]],
    return_dict_in_generate=True,
)

# process output
prediction = processor.batch_decode(outputs.sequences)[0]
prediction = processor.token2json(prediction)

Hi,

Thanks for your interest in Donut. I would recommend verifying that you’re loading the exact same weights before saving and after reloading.