Finetuning text summary model support change pretrained model?

jackieliu930 · September 7, 2021, 3:58am

hi,

I am following the tutorial 08 distributed training and i am dealing with text summary in different language(here, chinese),i am wondering, any guidence on how to change pretrained models for multi-language?

best,
jackie

philschmid · September 7, 2021, 6:02am

You can adjust the model, which should be used for training in the hyperparameters. Replace it with the model you want to use.


# hyperparameters, which are passed into the training job
hyperparameters={'per_device_train_batch_size': 4,
                 'per_device_eval_batch_size': 4,
                 'model_name_or_path': 'facebook/bart-large-cnn', # model used for training

jackieliu930 · September 7, 2021, 6:09am

cool! i have one further question: it turns out the training is very easy to run cuda oom, i tried p3.16x and works well during training with batch size=1, however failed when uploading model file to s3: ClientError: Artifact upload failed:Error 5: Received a failed archive status.any suggestion on this instead of simply add instance volume(since 16x already very large…)?

philschmid · September 7, 2021, 6:34am

Yes, it depends on the dataset size and model size you use it ran run quickly out oom.
Your error ClientError: Artifact upload failed:Error 5: Received a failed archive status is this appearing at the end of the training?
Do you have more insights than this?

jackieliu930 · September 7, 2021, 6:42am

actually, i looked at the cloudwatch log, and it has print out the training loss, eval metrics, etc. and no error information. in the sagemaker training job dashboard, just show below, it’s quite confusing to me since i normally only encounter oom problem during training, not after training.

philschmid · September 8, 2021, 5:13pm

Hey @jackieliu930,

Could you share your cloudwatch logs? Have you configured the permissions correctly? not that it fails because you cannot upload to s3?

jackieliu930 · September 16, 2021, 5:22am

hi! sorry for late response. seems like the model file is too large thus failed, I change the instance type and set smaller batch size, solved the problem.

Topic		Replies	Views
Training model file too large and fail to deploy Amazon SageMaker	3	1377	July 3, 2023
Distributed Training on Sagemaker Amazon SageMaker	13	2720	August 5, 2021
InternalServer Exception when deploying fine tuned model on Sagemaker Amazon SageMaker	4	858	September 14, 2021
Incrementally finetuning a HF model in SageMaker Amazon SageMaker	6	901	May 4, 2022
InternalServerError after model training finishes, but fails to upload? Amazon SageMaker	4	1131	August 31, 2021

Finetuning text summary model support change pretrained model?

Related topics