When I run a SageMaker training session with these settings, I only get one tfevent file in the tensorboard output bucket. Am I doing something wrong? I’m supposed to have a new tfevent file generated every 500 steps by default. At least that’s what I think is the intended behavior. As of now when I look at the Tensorboard for that run, I only see 1 data point. Here’s the code for the job:
train_input = TrainingInput(
"s3:/.../train.csv", content_type="csv"
)
test_input = TrainingInput(
"s3://.../test.csv", content_type="csv"
)
# configure git settings
git_config = {'repo': 'https://github.com/huggingface/transformers.git','branch': 'v4.6.1'}
tb_config = TensorBoardOutputConfig('s3://...')
# create the Estimator
huggingface_estimator = HuggingFace(
entry_point='run_glue.py',
source_dir='examples/pytorch/text-classification',
git_config=git_config,
instance_type='ml.p3.2xlarge',
instance_count=1,
role=role,
transformers_version='4.6',
pytorch_version='1.7',
py_version='py36',
tensorboard_output_config=tb_config,
hyperparameters=hyperparameters
)
# hyperparameters, which are passed into the training job
hyperparameters={'model_name_or_path': 'distilbert-base-uncased',
'task_name': 'cola',
'max_seq_length': 512,
'do_train': True,
'do_eval': True,
'do_predict': True,
'per_device_train_batch_size': 16,
'per_device_eval_batch_size': 16,
'output_dir': '/opt/ml/model',
'learning_rate': 2e-5,
'max_steps': 1500,
"evaluation_strategy": "steps"}