ESM-1nv model is not getting trained for more than 1 epoch

Tamizhisai · September 2, 2024, 7:43am

Hi, I am using AWS parallel cluster for implementation of NVDIA’s bionemo framework ( GitHub - aws-samples/awsome-distributed-training: Collection of best practices, reference architectures, model training examples and utilities to train large models on AWS.). When I try to pretrain the ESM-1nv model, the training happens only for one epoch and stops no matter what the training parameter we are providing. I attach the trainer configuration and output for reference:
Output:

Trainer configuration in output file:
0: trainer:
0: devices: 1
0: num_nodes: 1
0: accelerator: gpu
0: precision: 16-mixed
0: logger: false
0: enable_checkpointing: false
0: use_distributed_sampler: false
0: max_epochs: 10
0: max_steps: 100
0: log_every_n_steps: 10
0: val_check_interval: 50
0: limit_val_batches: 50
0: limit_test_batches: 500
0: accumulate_grad_batches: 1
0: gradient_clip_val: 1.0
0: benchmark: false
0: min_epochs: 2

Trainer configuration in base_config.yaml:trainer:

devices: 1 # number of GPUs or CPUs
num_nodes: 1
accelerator: gpu #gpu or cpu
precision: 16-mixed # 16-mixed, bf16-mixed or 32
logger: False # logger is provided by NeMo exp_manager
enable_checkpointing: False # checkpointing is done by NeMo exp_manager
use_distributed_sampler: False # use NeMo Megatron samplers
max_epochs: 10 # # use max_steps instead with NeMo Megatron model
max_steps: 1000 # consumed_samples = global_step * micro_batch_size * data_parallel_size * accumulate_grad_batches
log_every_n_steps: 10 # number of interations between logging
val_check_interval: 1500
limit_val_batches: 50 # number of batches in validation step, use fraction for fraction of data, 0 to disable
limit_test_batches: 500 # number of batches in test step, use fraction for fraction of data, 0 to disable
accumulate_grad_batches: 1
gradient_clip_val: 1.0
benchmark: False

Topic		Replies	Views
ESM-1nv model trains for only one epoch Beginners	0	18	September 3, 2024
Resuming accelerate-based pretraining with different batch size Intermediate	0	767	January 31, 2023
Different models when loading checkpoint (run_mlm) 🤗Transformers	2	504	February 24, 2021
All the training jobs end up getting stopped 🤗AutoTrain	6	2142	April 17, 2024
Autotrain Advanced (local) finished training between epochs i.e not sure it actually completed 🤗AutoTrain	2	1145	October 13, 2023

ESM-1nv model is not getting trained for more than 1 epoch

Related topics