With “codegen-2B-multi” model using Deepspeed
and gradient checkpointing
, Trainer
loop gives no. of trainable parameters as 0
. This is where the logging happens:
# Train!
logger.info("***** Running training *****")
logger.info(f" Num examples = {num_examples}")
logger.info(f" Num Epochs = {num_train_epochs}")
logger.info(f" Instantaneous batch size per device = {args.per_device_train_batch_size}")
logger.info(f" Total train batch size (w. parallel, distributed & accumulation) = {total_train_batch_size}")
logger.info(f" Gradient Accumulation steps = {args.gradient_accumulation_steps}")
logger.info(f" Total optimization steps = {max_steps}")
logger.info(
f" Number of trainable parameters = {sum(p.numel() for p in model.parameters() if p.requires_grad)}"
)
self.state.epoch = 0
start_time = time.time()
epochs_trained = 0
steps_trained_in_current_epoch = 0
steps_trained_progress_bar = None
# Check if continuing training from a checkpoint
if resume_from_checkpoint is not None and os.path.isfile(
When I am just computing the same without using the Trainer loop (w/o Deepspeed etc.), it is accurately outputting 2B parameters. I am unable to debug this issue, can anyone help in this regard?
1 Like
I meet the same problem! Could I ask whether you have solved it?
1 Like
czsun
March 22, 2023, 7:42am
3
logger.info(f" Number of trainable parameters = {sum(p.numel() + p.ds_numel for p in model.parameters() if p.requires_grad)}"
)
but I got an error p does not have attribute ds_numel.
same problem. is there any solution?
1 Like