Finetuning text summary model support change pretrained model?

hi,

I am following the tutorial 08 distributed training and i am dealing with text summary in different language(here, chinese),i am wondering, any guidence on how to change pretrained models for multi-language?

best,
jackie

You can adjust the model, which should be used for training in the hyperparameters. Replace it with the model you want to use.


# hyperparameters, which are passed into the training job
hyperparameters={'per_device_train_batch_size': 4,
                 'per_device_eval_batch_size': 4,
                 'model_name_or_path': 'facebook/bart-large-cnn', # model used for training

cool! i have one further question: it turns out the training is very easy to run cuda oom, i tried p3.16x and works well during training with batch size=1, however failed when uploading model file to s3: ClientError: Artifact upload failed:Error 5: Received a failed archive status.any suggestion on this instead of simply add instance volume(since 16x already very large…)?

Yes, it depends on the dataset size and model size you use it ran run quickly out oom.
Your error ClientError: Artifact upload failed:Error 5: Received a failed archive status is this appearing at the end of the training?
Do you have more insights than this?

actually, i looked at the cloudwatch log, and it has print out the training loss, eval metrics, etc. and no error information. in the sagemaker training job dashboard, just show below, it’s quite confusing to me since i normally only encounter oom problem during training, not after training.

Hey @jackieliu930,

Could you share your cloudwatch logs? Have you configured the permissions correctly? not that it fails because you cannot upload to s3?

hi! sorry for late response. seems like the model file is too large thus failed, I change the instance type and set smaller batch size, solved the problem.

1 Like