Is HF Trainer checkpointing usable?

FDM1 · September 5, 2024, 5:20pm

Hi to all!
I am training an RTDETR on a custom dataset.
but saving checkpoints of the best model while training takes 1 hour or so.
That means that even small models on small datasets require a pretty long ammount of time to be trained.

I am presenting this as an issue because other methods for saving models like from torch.save of lightningtorch trainer are almost instantanious. So it seams to me that default HF trainer is almost unusable.

I am new to HF so I am sure I am missing something here and that HF trainer has its use.
Could someone advice me on how to solve this problem or how to correctly implement anusable trainer instance?

I really thank you in advance.
Code below.

    # Define training arguments
    training_args = TrainingArguments(
        output_dir="weights",
        num_train_epochs=50,
        max_grad_norm=0.01,
        learning_rate=5e-5,
        warmup_steps=300,
        per_device_train_batch_size=8,
        dataloader_num_workers=0,
        metric_for_best_model="loss",  # eval_map
        greater_is_better=False,  # True if metric_for_best_model="eval_map"
        load_best_model_at_end=True, 
        eval_strategy="epoch", 
        save_strategy="epoch",
        #save_steps=0.98,
        save_only_model=True,
        save_total_limit=2,
        remove_unused_columns=False,
        eval_do_concat_batches=False,
    )

    # Initialize Trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=pytorch_dataset_train,
        eval_dataset=pytorch_dataset_valid,
        tokenizer=processor,
        data_collator=collate_fn,
        compute_metrics=eval_compute_metrics_fn,
    )

    # Train the model
    trainer.train()

Topic		Replies	Views
Saving CHECKPOINTS takes way too long Beginners	0	105	September 2, 2024
Checkpoint missing Optimizer.pt? How to Resume? 🤗Transformers	7	5475	May 18, 2021
Is the Trainer supposed to be saving checkpoints for every process? Beginners	0	10	July 20, 2024
BetterTransformer with HF Trainer 🤗Transformers	2	282	February 3, 2024
Questions about Checkpoints within HF Beginners	0	304	May 10, 2021

Is HF Trainer checkpointing usable?

Related topics