Hi, I am using AWS parallel cluster for implementation of NVDIA’s bionemo framework ( GitHub - aws-samples/awsome-distributed-training: Collection of best practices, reference architectures, model training examples and utilities to train large models on AWS.). When I try to pretrain the ESM-1nv model, the training happens only for one epoch and stops no matter what the training parameter we are providing. I attach the trainer configuration and output for reference:
Output:
Trainer configuration in output file:
0: trainer:
0: devices: 1
0: num_nodes: 1
0: accelerator: gpu
0: precision: 16-mixed
0: logger: false
0: enable_checkpointing: false
0: use_distributed_sampler: false
0: max_epochs: 10
0: max_steps: 100
0: log_every_n_steps: 10
0: val_check_interval: 50
0: limit_val_batches: 50
0: limit_test_batches: 500
0: accumulate_grad_batches: 1
0: gradient_clip_val: 1.0
0: benchmark: false
0: min_epochs: 2
Trainer configuration in base_config.yaml:trainer:
devices: 1 # number of GPUs or CPUs
num_nodes: 1
accelerator: gpu #gpu or cpu
precision: 16-mixed # 16-mixed, bf16-mixed or 32
logger: False # logger is provided by NeMo exp_manager
enable_checkpointing: False # checkpointing is done by NeMo exp_manager
use_distributed_sampler: False # use NeMo Megatron samplers
max_epochs: 10 # # use max_steps instead with NeMo Megatron model
max_steps: 1000 # consumed_samples = global_step * micro_batch_size * data_parallel_size * accumulate_grad_batches
log_every_n_steps: 10 # number of interations between logging
val_check_interval: 1500
limit_val_batches: 50 # number of batches in validation step, use fraction for fraction of data, 0 to disable
limit_test_batches: 500 # number of batches in test step, use fraction for fraction of data, 0 to disable
accumulate_grad_batches: 1
gradient_clip_val: 1.0
benchmark: False