Hi to all!
I am training an RTDETR on a custom dataset.
but saving checkpoints of the best model while training takes 1 hour or so.
That means that even small models on small datasets require a pretty long ammount of time to be trained.
I am presenting this as an issue because other methods for saving models like from torch.save of lightningtorch trainer are almost instantanious. So it seams to me that default HF trainer is almost unusable.
I am new to HF so I am sure I am missing something here and that HF trainer has its use.
Could someone advice me on how to solve this problem or how to correctly implement anusable trainer instance?
I really thank you in advance.
Code below.
# Define training arguments
training_args = TrainingArguments(
output_dir="weights",
num_train_epochs=50,
max_grad_norm=0.01,
learning_rate=5e-5,
warmup_steps=300,
per_device_train_batch_size=8,
dataloader_num_workers=0,
metric_for_best_model="loss", # eval_map
greater_is_better=False, # True if metric_for_best_model="eval_map"
load_best_model_at_end=True,
eval_strategy="epoch",
save_strategy="epoch",
#save_steps=0.98,
save_only_model=True,
save_total_limit=2,
remove_unused_columns=False,
eval_do_concat_batches=False,
)
# Initialize Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=pytorch_dataset_train,
eval_dataset=pytorch_dataset_valid,
tokenizer=processor,
data_collator=collate_fn,
compute_metrics=eval_compute_metrics_fn,
)
# Train the model
trainer.train()