I’m pre training a distillBert model from scratch and saving the model every 300 steps , When trying to load a checkpoint to continue training from the Trainer show that it’s skipping the trained steps but it just starts from 0 and doesn’t start logging or saving until the trainer passes the number of skipped steps.
The progress bar starts at 0 not at the saved number of steps.
which version of transformers are you using?
I’m using version 3.3.1
For example I had trained the model until it reached step number 48000 which took around 5 hours, when I loaded this checkpoint as the snippet above it printed this output.
**** Running training *****
Num examples = 66687128
Num Epochs = 10
Instantaneous batch size per device = 32
Total train batch size (w. parallel, distributed & accumulation) = 32
Gradient Accumulation steps = 1
Total optimization steps = 20839730
Continuing training from checkpoint, will skip to saved global_step
Continuing training from epoch 0
Continuing training from global step 48000
Continuing training from 0 non-embedding floating-point operations
Will skip the first 48000 steps in the first epoch
but the progress bar started at 0 and it took another 5 hours until it reached the 48000 step again and then it started logging and saving.
It’s normal that the progress bar starts at 0 again and goes through the 48,000 first steps while doing nothing, this is done to get at the same point in your data that you were at the time of checkpoint.
If it takes 5 hours to get there, the cause is very likely that your data loading is too slow, because this is the only thing the Trainer does for those steps (no model update, no evaluation, no logging, no saving, just going through the batches).
It feels odd that I have to iterate over the past steps while using a map-style dataset. Couldn’t the batch sampler just fast-forward to the desired step?
I am loading a huge dataset for language modelling. The operation is IO-bounded and it is taking a few hours to go over the steps.
I load the dataset from disk and continue pretraining from the checkpoint. But it looks like the Trainer goes through 20k steps while doing nothing (the process cost more than 16 hours). I wonder if there is any approach to skip such process (only use the weights, optimizer and scheduler from the checkpoint except the data point)? Thanks.
there is no way to be in the exact sample place in the dataloaders (that have randomness with the shuffling) without going through the first epochs and then batches.
Very unfortunate, I still wonder why this randomness of the dataloaders can’t be directly saved. If you don’t care about a slight differentiation of data, you can use the ignore_data_flag.