Saving CHECKPOINTS takes way too long

FDM1 · September 2, 2024, 3:18pm

Hi. I am practicing wit this awesome library and was training some object detection models on custom datasets.

I save model each epoch if it is best model + last model. But saving the model, also if save_only_model flag is True take so much time it makes training unpractical even for very small datasets.

Is there a way to save faster of to save stilized models to then load?

I thank you for you availability in advance. Code of trainer below and output below.
Best regards to all.

training_args = TrainingArguments(
    output_dir="weights",
    num_train_epochs=20,
    max_grad_norm=0.01,
    learning_rate=5e-5,
    warmup_steps=300,
    per_device_train_batch_size=2,
    #gradient_accumulation_steps=4,
    dataloader_num_workers=0,
    metric_for_best_model="loss", # eval_map
    greater_is_better=False, # True if metric_for_best_model="eval_map"
    load_best_model_at_end=True,
    eval_strategy="epoch",
    save_strategy="epoch",
    save_total_limit=2,
    save_only_model=True,
    remove_unused_columns=False,
    eval_do_concat_batches=False,
)


trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=pytorch_dataset_train,
    eval_dataset=pytorch_dataset_valid,
    tokenizer=processor,
    data_collator=collate_fn,
    compute_metrics=eval_compute_metrics_fn,
)

trainer.train()

Output

You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
 25%|██▌       | 500/1972 [02:58<07:25,  3.30it/s]{'loss': 129.9299, 'grad_norm': 232.07464599609375, 'learning_rate': 4.401913875598087e-05, 'epoch': 0.51}
 50%|█████     | 986/1972 [05:38<04:52,  3.37it/s]
  0%|          | 0/28 [00:00<?, ?it/s]
  7%|▋         | 2/28 [00:00<00:05,  5.13it/s]
 11%|█         | 3/28 [00:00<00:06,  3.66it/s]
 14%|█▍        | 4/28 [00:01<00:07,  3.19it/s]
 18%|█▊        | 5/28 [00:01<00:07,  2.97it/s]
 21%|██▏       | 6/28 [00:01<00:07,  2.84it/s]
 25%|██▌       | 7/28 [00:02<00:07,  2.76it/s]
 29%|██▊       | 8/28 [00:02<00:07,  2.71it/s]
 32%|███▏      | 9/28 [00:03<00:07,  2.66it/s]
 36%|███▌      | 10/28 [00:03<00:06,  2.63it/s]
 39%|███▉      | 11/28 [00:03<00:06,  2.62it/s]
 43%|████▎     | 12/28 [00:04<00:06,  2.59it/s]
 46%|████▋     | 13/28 [00:04<00:05,  2.55it/s]
 50%|█████     | 14/28 [00:05<00:05,  2.55it/s]
 54%|█████▎    | 15/28 [00:05<00:05,  2.58it/s]
 57%|█████▋    | 16/28 [00:05<00:04,  2.59it/s]
 61%|██████    | 17/28 [00:06<00:04,  2.59it/s]
 64%|██████▍   | 18/28 [00:06<00:03,  2.61it/s]
 68%|██████▊   | 19/28 [00:06<00:03,  2.61it/s]
 71%|███████▏  | 20/28 [00:07<00:03,  2.62it/s]
 75%|███████▌  | 21/28 [00:07<00:02,  2.62it/s]
 79%|███████▊  | 22/28 [00:08<00:02,  2.61it/s]
 82%|████████▏ | 23/28 [00:08<00:01,  2.62it/s]
 86%|████████▌ | 24/28 [00:08<00:01,  2.63it/s]
 89%|████████▉ | 25/28 [00:09<00:01,  2.62it/s]
 93%|█████████▎| 26/28 [00:09<00:00,  2.62it/s]
 96%|█████████▋| 27/28 [00:09<00:00,  2.78it/s]
100%|██████████| 28/28 [00:10<00:00,  3.53it/s]

Topic		Replies	Views
Save only best model in Trainer 🤗Transformers	31	85486	June 25, 2024
Is HF Trainer checkpointing usable? Community Calls	0	23	September 5, 2024
Saving checkpoints only on improvement 🤗Transformers	2	76	February 8, 2025
Checkpoints and disk storage 🤗Transformers	15	8074	June 2, 2024
Saving only the best performing checkpoint 🤗Transformers	19	18224	May 23, 2023

Saving CHECKPOINTS takes way too long

Related topics