Fine-Tune dit-large-finetuned-rvlcdip

elisasylvie · December 21, 2023, 10:35am

I’m trying to fine-tune “microsoft/dit-large-finetuned-rvlcdip” on my custom dataset which is made up of three document classes (“Altro”, “Fattura”, “Prescrizione”).
Below the training parameters:

training_args = TrainingArguments(
        auto_find_batch_size=True,
        output_dir=str(output_location), 
        remove_unused_columns=False,
        evaluation_strategy="epoch",
        save_strategy="epoch",
        num_train_epochs=10,
        per_device_train_batch_size=16,
        per_device_eval_batch_size=16,
        gradient_accumulation_steps=2,
        save_total_limit=3,
        seed=42,
        load_best_model_at_end=True,
        logging_steps=10,
        optim="adamw_torch",
        bf16=False,
        fp16=False,
        learning_rate=1e-5,
        warmup_ratio=0.1,
        report_to='tensorboard'
    )

Training is performed on 2-GPUs (NVIDIA GeForce RTX 2070).

As can be seen in the figure below, the problem is that validation loss and validation metrics remain stationary starting from the second epoch.

What could be the cause of this behavior? I expected at least a small change in loss and validation metrics.
I get the same behavior by using more or less training data and both enabling and disabling different data augmentation strategies.

nielsr · December 21, 2023, 2:57pm

Hi @elisasylvie,

If you’re using ignore_mismatched_sizes=True, we recently fixed a bug which might be related to this. Could you try by installing Transformers from the main branch (pip install --upgrade git+https://github.com/huggingface/transformers.git) and try again?

elisasylvie · December 21, 2023, 4:15pm

Hi @nielsr, thanks for your quick response.
I followed your suggestion and installed Transformers from the main branch.
Unfortunately I’m not able to start training due to the following error:

RuntimeError: CUDA has been initialized before the notebook_launcher could create a forked subprocess. This likely stems from an outside import causing issues once the notebook_launcher() is called. Please review your imports and test them when running the notebook_launcher() to identify which one is problematic and causing CUDA to be initialized.

I specify that there is nothing in the code outside of notebook_launcher() that explicitly initializes the CUDA context. The version for accelerate is 0.25.0.

ydshieh · December 21, 2023, 11:06pm

Hi. I do see same issue is opened here. But I don’t know well what is the current state on this issue.

If possible, the following information will be very helpful:

previously (before you installed transformers main branch), what was the accelerate version (the one that didn’t have CUDA has been initialized issue)
if you are running google colab, could you share it (probably use a public dataset instead of your custom dataset)
otherwise, could you share your env. info, like gpu type, number of gpu, pytorch version, cuda versioin etc. (you can run transformers-cli env to get this)

Topic		Replies	Views
Should 24GB of VRAM be able to fine tune a 1B model? Beginners	9	650	February 23, 2025
How do i get Training and Validation Loss during fine tuning 🤗Transformers	2	14693	August 27, 2021
Same metrics after every epoch Beginners	4	358	May 30, 2024
Tensor types mismatch when trying to enable GPU Beginners	0	975	June 16, 2023
Single batch training on multi-gpu 🤗Accelerate	1	996	October 8, 2023

Fine-Tune dit-large-finetuned-rvlcdip

Related topics