Load from checkpoint not skipping steps

Rashwan · October 15, 2020, 1:37pm

I’m pre training a distillBert model from scratch and saving the model every 300 steps , When trying to load a checkpoint to continue training from the Trainer show that it’s skipping the trained steps but it just starts from 0 and doesn’t start logging or saving until the trainer passes the number of skipped steps.

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = DistilBertForMaskedLM.from_pretrained("/content/drive/My Drive/AIMBert/output_gpu/checkpoint-48000",config=config).to(device)

training_args = TrainingArguments(
output_dir="/content/drive/My Drive/AIMBert/output_gpu",
logging_dir='/content/drive/My Drive/AIMBert/logs_gpu',
overwrite_output_dir=True,
num_train_epochs=10,
per_device_train_batch_size=32,
per_device_eval_batch_size = 32,
logging_steps = 100,
save_steps=300,
save_total_limit=5,
evaluation_strategy = "steps",
eval_steps=50000,
seed = 42,
prediction_loss_only=True,
fp16 = True
)

trainer = Trainer(
model=model,
args=training_args,
data_collator=data_collator,
train_dataset= processed_train_dataset,
eval_dataset = processed_valid_dataset
)

trainer.train('/content/drive/My Drive/AIMBert/output_gpu/checkpoint-48000')

sgugger · October 15, 2020, 6:37pm

What do you mean by “It starts at 0?” Also, which version of transformers are you using?

Rashwan · October 15, 2020, 11:43pm

What do you mean by “It starts at 0?

The progress bar starts at 0 not at the saved number of steps.

which version of transformers are you using?

I’m using version 3.3.1

For example I had trained the model until it reached step number 48000 which took around 5 hours, when I loaded this checkpoint as the snippet above it printed this output.

**** Running training *****
Num examples = 66687128
Num Epochs = 10
Instantaneous batch size per device = 32
Total train batch size (w. parallel, distributed & accumulation) = 32
Gradient Accumulation steps = 1
Total optimization steps = 20839730
Continuing training from checkpoint, will skip to saved global_step
Continuing training from epoch 0
Continuing training from global step 48000
Continuing training from 0 non-embedding floating-point operations
Will skip the first 48000 steps in the first epoch

but the progress bar started at 0 and it took another 5 hours until it reached the 48000 step again and then it started logging and saving.

sgugger · October 16, 2020, 3:28pm

It’s normal that the progress bar starts at 0 again and goes through the 48,000 first steps while doing nothing, this is done to get at the same point in your data that you were at the time of checkpoint.
If it takes 5 hours to get there, the cause is very likely that your data loading is too slow, because this is the only thing the Trainer does for those steps (no model update, no evaluation, no logging, no saving, just going through the batches).

Rashwan · October 16, 2020, 10:04pm

I will look into it. Thanks for your help @sgugger

vinicius-cleves · August 7, 2021, 12:00am

It feels odd that I have to iterate over the past steps while using a map-style dataset. Couldn’t the batch sampler just fast-forward to the desired step?

I am loading a huge dataset for language modelling. The operation is IO-bounded and it is taking a few hours to go over the steps.

klein9692 · April 21, 2022, 10:47am

I load the dataset from disk and continue pretraining from the checkpoint. But it looks like the Trainer goes through 20k steps while doing nothing (the process cost more than 16 hours). I wonder if there is any approach to skip such process (only use the weights, optimizer and scheduler from the checkpoint except the data point)? Thanks.

blastwind · April 17, 2023, 1:31pm

From sgugger here:

there is no way to be in the exact sample place in the dataloaders (that have randomness with the shuffling) without going through the first epochs and then batches.

Very unfortunate, I still wonder why this randomness of the dataloaders can’t be directly saved. If you don’t care about a slight differentiation of data, you can use the ignore_data_flag.

Topic		Replies	Views
No skipping steps after loading from checkpoint 🤗Transformers	16	7544	April 21, 2022
Resume_from_checkpoint Models	1	2357	June 25, 2024
Resume training from checkpoint Beginners	1	3041	January 5, 2023
Loading model from checkpoint after error in training Beginners	9	41643	May 2, 2024
Continuing Pre Training from Model Checkpoint Models	12	42283	January 13, 2025

Load from checkpoint not skipping steps

Related topics