I want to use data batch iteration number to save model or print logs. But I find that I can’t get the total iteration number across multi-gpu devices. It seems to only get the iteration number on single gpu. My code is:
for epoch in range(start_epoch, total_epoch + 1):
if resume_state and epoch == start_epoch and current_iter is not None:
# We need to skip steps until we reach the resumed step
active_dataloader = accelerator.skip_first_batches(train_loader, current_iter)
else:
active_dataloader = train_loader
for idx, train_data in enumerate(active_dataloader):
data_timer.record()
current_iter += 1 # get total iteration number
model.feed_data(train_data)
model.optimize_parameters()
...
if current_iter % opt['logger']['print_freq'] == 0:
log_vars = {'epoch': epoch, 'iter': current_iter}
log_vars.update({'lrs': model.get_current_learning_rate()})
log_vars.update({'time': iter_timer.get_avg_time(), 'data_time': data_timer.get_avg_time()})
log_vars.update(model.get_current_log())
msg_logger(log_vars)
# save models and training states
if current_iter % opt['logger']['save_checkpoint_freq'] == 0:
logger.info('Saving models and training states.')
model.save_state(epoch, current_iter)
I am not sure if I should use “current_iter += 1” to get total iteration number?