Don't understand the progress bar when launching fine-tuning jobs (Sagemaker)

Background

I am finetuning a mistral-7B-instruct-v01 model using the same workflow as is outlined in these two blogposts (using Sagemaker):

Everything works seemingly great, and the fine-tuned models produces results that looks very good. I’m curious about the progress bar however.

As I run the finetuning for a small dataset containing 100 observations with the following setting:

  • num_train_epochs = 2
  • per_device_batch_size = 1
  • gradient_accumulation_steps = 4
  • SM_LOG_LEVEL=20
  • logging_steps = 25

This is the attained progress bar during the fine-tuning:

0%|          | 0/4 [00:00<?, ?it/s]
25%|██▌       | 1/4 [00:24<01:14, 24.76s/it]
50%|█████     | 2/4 [00:49<00:49, 24.51s/it]
75%|███████▌  | 3/4 [01:13<00:24, 24.47s/it]
100%|██████████| 4/4 [01:37<00:00, 24.42s/it]
{'train_runtime': 97.8848, 'train_samples_per_second': 0.184, 'train_steps_per_second': 0.041, 'train_loss': 1.038140892982483, 'epoch': 1.78}
100%|██████████| 4/4 [01:37<00:00, 24.42s/it]
100%|██████████| 4/4 [01:37<00:00, 24.47s/it]

When I instead run the fine-tuning on a dataset that contains 10,000 observations the progress bar looks like this (just showing final iterations here):

100%|█████████▉| 491/492 [3:19:46<00:24, 24.41s/it]
100%|██████████| 492/492 [3:20:10<00:00, 24.40s/it]
{'train_runtime': 12010.6264, 'train_samples_per_second': 0.164, 'train_steps_per_second': 0.041, 'train_loss': 0.5181044475819038, 'epoch': 2.0}
100%|██████████| 492/492 [3:20:10<00:00, 24.40s/it]
100%|██████████| 492/492 [3:20:10<00:00, 24.41s/it]

The run time for the last job is in the some order of magnitude as the run-time for the finetuning job carried out in this blog (Fine-Tune & Evaluate LLMs in 2024 with Amazon SageMaker) - 1.8 hours vs 3.3 hours for me. I uses managed spot training, have on average rather large inputs, and save the model every 100 steps, so I’m not too worried that training time is longer, even though Philip used 1 more epoch during the training. My finetuned model also produces results that are much better than the raw model for my use-case, so it seems like the finetuning works.

Question
I don’t understand the iteration updates in the progress bar.

When having only 100 observation in the finetuning, the number of steps when using two epochs, a batch_size of 1, and gradient accumulation_step of 4, should be 200 / 4 = 50.

Analogously, when we have 10,000 observation, the number of steps should be 20,000 / 4 = 5000.

why is the progress bar showing 4 and 492 iteration steps here?

Code

job_name = f'mistralinstruct-7b-hf-mini'

hyperparameters = {
  'dataset_path': '/opt/ml/input/data/training/train_dataset.json',
  'model_id': "mistralai/Mistral-7B-Instruct-v0.1",
  'max_seq_len': 3872,
  'use_qlora': True,
  'num_train_epochs': 2,
  'per_device_train_batch_size': 1,
  'gradient_accumulation_steps': 4,
  'gradient_checkpointing': True,
  'optim': "adamw_torch_fused",
  'logging_steps': 25,
  'save_strategy': "steps",
  'save_steps' : 100,
  'learning_rate': 2e-4,
  'bf16': True,
  'tf32': True, 
  'max_grad_norm': 1.0,
  'warmup_ratio': 0.03,
  'lr_scheduler_type': "constant",
  'report_to': "tensorboard",
  'output_dir': "/opt/ml/checkpoints",
  'merge_adapters': True,
}


sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket='com.ravenpack.dsteam.research.testing'
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()
print(sagemaker_session_bucket)

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='SageMaker-ds-research-testing')['Role']['Arn']

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)


tensorboard_output_config = TensorBoardOutputConfig(
    container_local_output_path='/opt/ml/output/tensorboard',
    s3_output_path = f's3://{sess.default_bucket()}/...{my_path}...',
)


metric_definitions = [
    {'Name': 'loss', 'Regex': "'loss':\s*([0-9\\.]+)"},
    {'Name': 'grad_norm', 'Regex': "'grad_norm':\s*([0-9\\.]+)"},
    {'Name': 'learning_rate', 'Regex': "'learning_rate':\s*([0-9\\.]+)"},
    {'Name': 'epoch', 'Regex': "'epoch':\s*([0-9\\.]+)"}
]


# create the Estimator
huggingface_estimator = HuggingFace(
    entry_point          = 'run_sft.py',    # train script (used Philip's from https://github.com/philschmid/llm-sagemaker-sample/blob/main/scripts/trl/run_sft.py)
    source_dir           = '...{my_path}...', 
    instance_type        = 'ml.g5.4xlarge',
    instance_count       = 1,             
    max_run              = 1*24*60*60,
    max_wait             = 2*24*60*60,       
    use_spot_instances   = True,
    base_job_name        = job_name,         
    role                 = role,
    volume_size          = 300,
    transformers_version = '4.36',
    pytorch_version      = '2.1',
    py_version           = 'py310',
    hyperparameters      =  hyperparameters,
    disable_output_compression = True,
    environment          = {
                            "HUGGINGFACE_HUB_CACHE": "/tmp/.cache",
                            },
    metric_definitions   = metric_definitions,
    tensorboard_output_config = tensorboard_output_config,
    
    checkpoint_s3_uri = f's3://{sess.default_bucket()}/...{my_path}...',
)   


training_input_path = f's3://{sess.default_bucket()}/...{my_path}...'


data = {'training': training_input_path}

# starting the train job with our uploaded datasets as input
huggingface_estimator.fit(data, wait=True)