Only one SageMaker TFEvent?

rosenjcb · October 15, 2021, 6:44pm

When I run a SageMaker training session with these settings, I only get one tfevent file in the tensorboard output bucket. Am I doing something wrong? I’m supposed to have a new tfevent file generated every 500 steps by default. At least that’s what I think is the intended behavior. As of now when I look at the Tensorboard for that run, I only see 1 data point. Here’s the code for the job:

train_input = TrainingInput(
    "s3:/.../train.csv", content_type="csv"
)
test_input = TrainingInput(
    "s3://.../test.csv", content_type="csv"
)

# configure git settings
git_config = {'repo': 'https://github.com/huggingface/transformers.git','branch': 'v4.6.1'}

tb_config = TensorBoardOutputConfig('s3://...')

 # create the Estimator
huggingface_estimator = HuggingFace(
        entry_point='run_glue.py',
        source_dir='examples/pytorch/text-classification',
        git_config=git_config,
        instance_type='ml.p3.2xlarge',
        instance_count=1,
        role=role,
        transformers_version='4.6',
        pytorch_version='1.7',
        py_version='py36',
        tensorboard_output_config=tb_config,
        hyperparameters=hyperparameters
)

# hyperparameters, which are passed into the training job
hyperparameters={'model_name_or_path': 'distilbert-base-uncased',
                 'task_name': 'cola',
                 'max_seq_length': 512,
                 'do_train': True,
                 'do_eval': True,
                 'do_predict': True,
                 'per_device_train_batch_size': 16,
                 'per_device_eval_batch_size': 16,
                 'output_dir': '/opt/ml/model',
                 'learning_rate': 2e-5,
                 'max_steps': 1500,
                 "evaluation_strategy": "steps"}

Topic		Replies	Views
Tensorboard does not load on hub, loads locally, tfevents files are uploaded to hub 🤗Hub	7	655	September 6, 2024
`flan-t5-xl` model does not appear to have a file named `pytorch_model.bin` Amazon SageMaker	10	4981	August 22, 2023
Endpoint Deployment Amazon SageMaker	1	1108	September 20, 2021
Distributed Training on Sagemaker Amazon SageMaker	13	2719	August 5, 2021
Different results with model hosted in HuggingFace and hosted in SageMaker Models	1	591	November 15, 2023

Only one SageMaker TFEvent?

Related topics