Running out of memory with all except the basic GPT2 and GPT neo models on sagemaker127M

kanandk · March 31, 2023, 7:12am

How do I set the max_split_size_mb or gradient_checkpointing or f16 parameters in the HuggingFace Constructor?

My code is as follows:

from sagemaker.huggingface import HuggingFace

role = sagemaker.get_execution_role()

hyper_params = {
‘model_name_or_path’ : gpt_model,
‘output_dir’ : ‘/opt/ml/model’,
‘do_train’ : True,
‘train_file’ : ‘/opt/ml/input/data/train/{}’.format(training_file_name),

'num_train_epochs' : 5,
'per_device_train_batch_size' : 10,

}

git_config = { ‘repo’ : ‘GitHub - huggingface/transformers: 🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.’, ‘branch’: ‘v4.17.0’ }

huggingface_estimator = HuggingFace(
entry_point=‘run_clm.py’,
source_dir=‘./examples/pytorch/language-modeling’,
instance_type=‘ml.g4dn.2xlarge’,
env = { ‘max_split_mb_size’ : 512 },
instance_count=1,
role=role,
git_config=git_config,
transformers_version=‘4.17.0’,
pytorch_version=‘1.10.2’,
py_version=‘py38’,
hyperparameters = hyper_params,
gradient_checkpointing=True,
fp16=True,
)

huggingface_estimator.fit({‘train’: s3_training_data}, wait = True)

Topic		Replies	Views
Sagemaker gpt-j train file error Amazon SageMaker	27	2907	February 22, 2024
Fine Tuning GPT-2 - Training job only using test sample size of 5 Amazon SageMaker	4	2138	February 6, 2023
Predict function ignore parameters Amazon SageMaker	8	1169	January 28, 2022
Training model file too large and fail to deploy Amazon SageMaker	3	1377	July 3, 2023
Transformers 4.6.0 on SageMaker? Amazon SageMaker	14	4552	September 9, 2022

Running out of memory with all except the basic GPT2 and GPT neo models on sagemaker127M

Related topics