Fine-Tune dit-large-finetuned-rvlcdip

I’m trying to fine-tune “microsoft/dit-large-finetuned-rvlcdip” on my custom dataset which is made up of three document classes (“Altro”, “Fattura”, “Prescrizione”).
Below the training parameters:

training_args = TrainingArguments(
        auto_find_batch_size=True,
        output_dir=str(output_location), 
        remove_unused_columns=False,
        evaluation_strategy="epoch",
        save_strategy="epoch",
        num_train_epochs=10,
        per_device_train_batch_size=16,
        per_device_eval_batch_size=16,
        gradient_accumulation_steps=2,
        save_total_limit=3,
        seed=42,
        load_best_model_at_end=True,
        logging_steps=10,
        optim="adamw_torch",
        bf16=False,
        fp16=False,
        learning_rate=1e-5,
        warmup_ratio=0.1,
        report_to='tensorboard'
    )

Training is performed on 2-GPUs (NVIDIA GeForce RTX 2070).

As can be seen in the figure below, the problem is that validation loss and validation metrics remain stationary starting from the second epoch.

What could be the cause of this behavior? I expected at least a small change in loss and validation metrics.
I get the same behavior by using more or less training data and both enabling and disabling different data augmentation strategies.

Hi @elisasylvie,

If you’re using ignore_mismatched_sizes=True, we recently fixed a bug which might be related to this. Could you try by installing Transformers from the main branch (pip install --upgrade git+https://github.com/huggingface/transformers.git) and try again?

1 Like

Hi @nielsr, thanks for your quick response.
I followed your suggestion and installed Transformers from the main branch.
Unfortunately I’m not able to start training due to the following error:

RuntimeError: CUDA has been initialized before the notebook_launcher could create a forked subprocess. This likely stems from an outside import causing issues once the notebook_launcher() is called. Please review your imports and test them when running the notebook_launcher() to identify which one is problematic and causing CUDA to be initialized.

I specify that there is nothing in the code outside of notebook_launcher() that explicitly initializes the CUDA context. The version for accelerate is 0.25.0.

Hi. I do see same issue is opened here. But I don’t know well what is the current state on this issue.

If possible, the following information will be very helpful:

  • previously (before you installed transformers main branch), what was the accelerate version (the one that didn’t have CUDA has been initialized issue)
  • if you are running google colab, could you share it (probably use a public dataset instead of your custom dataset)
  • otherwise, could you share your env. info, like gpu type, number of gpu, pytorch version, cuda versioin etc. (you can run transformers-cli env to get this)