How to specify S3 bucket training debug output

Lboru · January 11, 2023, 4:55pm

Hello,

I’m working on a project where we’re using AWS SageMaker to run some jobs and we specified the default_bucket parameter in the SageMaker Session, but when we run the code, a new s3 bucket with the name sagemaker-{region}-{aws-account-id} is created and there’s folders created with each run with the prefix huggingface-pytorch-training-. In those folders, there’s three folders: debug-output/ , profiler-output/ , and source.

When taking a look at the SageMaker Session documentation, it says that the bucket is created if the default_bucket parameter is not specified:

default_bucket (str) – The default Amazon S3 bucket to be used by this session. This will be created the next time an Amazon S3 bucket is needed (by calling default_bucket()). If not provided, a default bucket will be created based on the following format: “sagemaker-{region}-{aws-account-id}”. Example: “sagemaker-my-custom-bucket”.

It seems like the predictions and checkpoint data is correctly saved to the S3 Bucket that is specified in the SageMaker Session but there’s still the lingering items. Can someone help me understand how we can set it so that this sagemaker-{region}-{aws-account-id} bucket doesn’t get created and instead the outputs created in the huggingface-pytorch-training- folder get directed to an S3 bucket that we specify?

Here’s a snippet of our code:

    session = sagemaker.Session(default_bucket=DESTINATION_S3_BUCKET)
    role = sagemaker.get_execution_role(sagemaker_session=session)

    hf_estimator = HuggingFace(
        role=role,
        entry_point=TRAINER_FILE,
        instance_type=instance_type,
        instance_count=INSTANCE_COUNT,
        transformers_version=TRANSFORMERS_VERSION,
        pytorch_version=PYTORCH_VERSION,
        py_version=PY_VERSION,
        checkpoint_s3_uri=CHECKPOINT_S3_URI,
        hyperparameters=HYPERPARAMETERS,
    )

and the TRAINER_FILE has the following:

    training_args = TrainingArguments(
        output_dir=args["OUTPUT_DIR"],
        optim=args["OPTIMIZER"],
        per_device_train_batch_size=16,
        num_train_epochs=args["TRAIN_EPOCHS"],
        learning_rate=args["LEARNING_RATE"],
        weight_decay=args["WEIGHT_DECAY"],
        warmup_ratio=args["WARMUP_RATIO"],
        per_device_eval_batch_size=16,
        save_strategy="epoch",
        logging_strategy="epoch",
        remove_unused_columns=False,
        fp16=True,
        push_to_hub=False,
    )

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_data,
        eval_dataset=validation_data,
        tokenizer=tokenizer,
        compute_metrics=compute_metrics,
    )

Topic		Replies	Views
Models is not saved in S3 bucket location Amazon SageMaker	0	318	April 9, 2024
Using S3 as model cache for Huggingface LLM inference DLC on Sagemaker Amazon SageMaker	1	3990	June 21, 2023
IAM Role Permissions to train Hugging Face model on Amazon Sagemaker Amazon SageMaker	1	1160	January 26, 2023
Creating Vision dataset with images on s3 Amazon SageMaker	9	2573	September 15, 2022
Sagemaker pytorch training Awesome paper	0	292	April 9, 2024

How to specify S3 bucket training debug output

Related topics