OutOfMemoryError: CUDA out of memory while trying to replicate this notebook on sagemaker: https://github.com/huggingface/notebooks/blob/main/sagemaker/24_train_bloom_peft_lora/sagemaker-notebook.ipynb

Training set up:

import time
# define Training Job Name 
job_name = f'huggingface-peft-chat-{time.strftime("%Y-%m-%d-%H-%M-%S", time.localtime())}'

from sagemaker.huggingface import HuggingFace

# hyperparameters, which are passed into the training job
hyperparameters ={
  'model_id': model_id,                                # pre-trained model
  'dataset_path': '/opt/ml/input/data/training', # path where sagemaker will save training dataset
  'epochs': 5,                                         # number of training epochs
  'per_device_train_batch_size': 1,                    # batch size for training
  'lr': 2e-4,                                          # learning rate used during training
}

# create the Estimator
huggingface_estimator = HuggingFace(
    entry_point          = 'run_clm.py',      # train script
    source_dir           = 'scripts',         # directory which includes all the files needed for training
    instance_type        = 'ml.g5.16xlarge', # instances type used for the training job
    instance_count       = 1,                 # the number of instances used for training
    base_job_name        = job_name,          # the name of the training job
    role                 = role,              # Iam role used in training job to access AWS ressources, e.g. S3
    volume_size          = 300,               # the size of the EBS volume in GB
    transformers_version = '4.26',            # the transformers version used in the training job
    pytorch_version      = '1.13',            # the pytorch_version version used in the training job
    py_version           = 'py39',            # the python version used in the training job
    hyperparameters      =  hyperparameters
)

Requirements file:

git+https://github.com/huggingface/peft.git
transformers==4.27.1
accelerate==0.17.1
bitsandbytes==0.37.1

Training error:

INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
Using provided s3_resource
INFO:sagemaker:Creating training-job with name: huggingface-peft-chat-2023-05-10-14-10--2023-05-10-14-10-27-991
2023-05-10 14:10:28 Starting - Starting the training job...
2023-05-10 14:10:49 Starting - Preparing the instances for training......
2023-05-10 14:11:56 Downloading - Downloading input data...
2023-05-10 14:12:11 Training - Downloading the training image...............
2023-05-10 14:14:52 Training - Training image download completed. Training in progress......bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
2023-05-10 14:15:37,684 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training
2023-05-10 14:15:37,699 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)
2023-05-10 14:15:37,709 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.
2023-05-10 14:15:37,711 sagemaker_pytorch_container.training INFO     Invoking user training script.
2023-05-10 14:15:37,926 sagemaker-training-toolkit INFO     Installing dependencies from requirements.txt:
/opt/conda/bin/python3.9 -m pip install -r requirements.txt
Collecting git+https://github.com/huggingface/peft.git (from -r requirements.txt (line 1))
Cloning https://github.com/huggingface/peft.git to /tmp/pip-req-build-dkjr7e3k
Running command git clone --filter=blob:none --quiet https://github.com/huggingface/peft.git /tmp/pip-req-build-dkjr7e3k
Resolved https://github.com/huggingface/peft.git to commit 4fd374e80d670781c0d82c96ce94d1215ff23306
Installing build dependencies: started
Installing build dependencies: finished with status 'done'
Getting requirements to build wheel: started
Getting requirements to build wheel: finished with status 'done'
Preparing metadata (pyproject.toml): started
Preparing metadata (pyproject.toml): finished with status 'done'
Collecting transformers==4.27.1
Downloading transformers-4.27.1-py3-none-any.whl (6.7 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.7/6.7 MB 92.5 MB/s eta 0:00:00
Collecting accelerate==0.17.1
Downloading accelerate-0.17.1-py3-none-any.whl (212 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 212.8/212.8 kB 50.8 MB/s eta 0:00:00
Collecting bitsandbytes==0.37.1
Downloading bitsandbytes-0.37.1-py3-none-any.whl (76.3 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 76.3/76.3 MB 34.7 MB/s eta 0:00:00
Requirement already satisfied: regex!=2019.12.17 in /opt/conda/lib/python3.9/site-packages (from transformers==4.27.1->-r requirements.txt (line 2)) (2022.10.31)
Requirement already satisfied: requests in /opt/conda/lib/python3.9/site-packages (from transformers==4.27.1->-r requirements.txt (line 2)) (2.28.2)
Requirement already satisfied: tokenizers!=0.11.3,<0.14,>=0.11.1 in /opt/conda/lib/python3.9/site-packages (from transformers==4.27.1->-r requirements.txt (line 2)) (0.13.2)
Requirement already satisfied: packaging>=20.0 in /opt/conda/lib/python3.9/site-packages (from transformers==4.27.1->-r requirements.txt (line 2)) (23.0)
Requirement already satisfied: huggingface-hub<1.0,>=0.11.0 in /opt/conda/lib/python3.9/site-packages (from transformers==4.27.1->-r requirements.txt (line 2)) (0.12.0)
Requirement already satisfied: pyyaml>=5.1 in /opt/conda/lib/python3.9/site-packages (from transformers==4.27.1->-r requirements.txt (line 2)) (5.4.1)
Requirement already satisfied: tqdm>=4.27 in /opt/conda/lib/python3.9/site-packages (from transformers==4.27.1->-r requirements.txt (line 2)) (4.64.1)
Requirement already satisfied: numpy>=1.17 in /opt/conda/lib/python3.9/site-packages (from transformers==4.27.1->-r requirements.txt (line 2)) (1.23.5)
Requirement already satisfied: filelock in /opt/conda/lib/python3.9/site-packages (from transformers==4.27.1->-r requirements.txt (line 2)) (3.9.0)
Requirement already satisfied: torch>=1.4.0 in /opt/conda/lib/python3.9/site-packages (from accelerate==0.17.1->-r requirements.txt (line 3)) (1.13.1+cu117)
Requirement already satisfied: psutil in /opt/conda/lib/python3.9/site-packages (from accelerate==0.17.1->-r requirements.txt (line 3)) (5.9.4)
Requirement already satisfied: typing-extensions>=3.7.4.3 in /opt/conda/lib/python3.9/site-packages (from huggingface-hub<1.0,>=0.11.0->transformers==4.27.1->-r requirements.txt (line 2)) (4.4.0)
Requirement already satisfied: charset-normalizer<4,>=2 in /opt/conda/lib/python3.9/site-packages (from requests->transformers==4.27.1->-r requirements.txt (line 2)) (2.1.1)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /opt/conda/lib/python3.9/site-packages (from requests->transformers==4.27.1->-r requirements.txt (line 2)) (1.26.14)
Requirement already satisfied: idna<4,>=2.5 in /opt/conda/lib/python3.9/site-packages (from requests->transformers==4.27.1->-r requirements.txt (line 2)) (3.4)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.9/site-packages (from requests->transformers==4.27.1->-r requirements.txt (line 2)) (2022.12.7)
Building wheels for collected packages: peft
Building wheel for peft (pyproject.toml): started
Building wheel for peft (pyproject.toml): finished with status 'done'
Created wheel for peft: filename=peft-0.4.0.dev0-py3-none-any.whl size=56306 sha256=73c1e5f8f4d7e5b949f205fc7586cf396cf810571a1d36e3df2f650cc8c9f205
Stored in directory: /tmp/pip-ephem-wheel-cache-l1th6vic/wheels/2d/60/1b/0edd9dc0f0c489738b1166bc1b0b560ee368f7721f89d06e3a
Successfully built peft
Installing collected packages: bitsandbytes, accelerate, transformers, peft
Attempting uninstall: accelerate
Found existing installation: accelerate 0.16.0
Uninstalling accelerate-0.16.0:
Successfully uninstalled accelerate-0.16.0
Attempting uninstall: transformers
Found existing installation: transformers 4.26.0
Uninstalling transformers-4.26.0:
Successfully uninstalled transformers-4.26.0
Successfully installed accelerate-0.17.1 bitsandbytes-0.37.1 peft-0.4.0.dev0 transformers-4.27.1
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
[notice] A new release of pip is available: 23.0 -> 23.1.2
[notice] To update, run: pip install --upgrade pip
2023-05-10 14:15:51,416 sagemaker-training-toolkit INFO     Waiting for the process to finish and give a return code.
2023-05-10 14:15:51,416 sagemaker-training-toolkit INFO     Done waiting for a return code. Received 0 from exiting process.
2023-05-10 14:15:51,434 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)
2023-05-10 14:15:51,462 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)
2023-05-10 14:15:51,490 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)
2023-05-10 14:15:51,501 sagemaker-training-toolkit INFO     Invoking user script
Training Env:
{
    "additional_framework_parameters": {},
    "channel_input_dirs": {
        "training": "/opt/ml/input/data/training"
    },
    "current_host": "algo-1",
    "current_instance_group": "homogeneousCluster",
    "current_instance_group_hosts": [
        "algo-1"
    ],
    "current_instance_type": "ml.g5.16xlarge",
    "distribution_hosts": [],
    "distribution_instance_groups": [],
    "framework_module": "sagemaker_pytorch_container.training:main",
    "hosts": [
        "algo-1"
    ],
    "hyperparameters": {
        "dataset_path": "/opt/ml/input/data/training",
        "epochs": 5,
        "lr": 0.0002,
        "model_id": "bigscience/bloomz-7b1",
        "per_device_train_batch_size": 1
    },
    "input_config_dir": "/opt/ml/input/config",
    "input_data_config": {
        "training": {
            "TrainingInputMode": "File",
            "S3DistributionType": "FullyReplicated",
            "RecordWrapperType": "None"
        }
    },
    "input_dir": "/opt/ml/input",
    "instance_groups": [
        "homogeneousCluster"
    ],
    "instance_groups_dict": {
        "homogeneousCluster": {
            "instance_group_name": "homogeneousCluster",
            "instance_type": "ml.g5.16xlarge",
            "hosts": [
                "algo-1"
            ]
        }
    },
    "is_hetero": false,
    "is_master": true,
    "is_modelparallel_enabled": null,
    "is_smddpmprun_installed": true,
    "job_name": "huggingface-peft-chat-2023-05-10-14-10--2023-05-10-14-10-27-991",
    "log_level": 20,
    "master_hostname": "algo-1",
    "model_dir": "/opt/ml/model",
    "module_dir": "s3://sagemaker-us-east-1-197614225699/huggingface-peft-chat-2023-05-10-14-10--2023-05-10-14-10-27-991/source/sourcedir.tar.gz",
    "module_name": "run_clm",
    "network_interface_name": "eth0",
    "num_cpus": 64,
    "num_gpus": 1,
    "num_neurons": 0,
    "output_data_dir": "/opt/ml/output/data",
    "output_dir": "/opt/ml/output",
    "output_intermediate_dir": "/opt/ml/output/intermediate",
    "resource_config": {
        "current_host": "algo-1",
        "current_instance_type": "ml.g5.16xlarge",
        "current_group_name": "homogeneousCluster",
        "hosts": [
            "algo-1"
        ],
        "instance_groups": [
            {
                "instance_group_name": "homogeneousCluster",
                "instance_type": "ml.g5.16xlarge",
                "hosts": [
                    "algo-1"
                ]
            }
        ],
        "network_interface_name": "eth0"
    },
    "user_entry_point": "run_clm.py"
}
Environment variables:
SM_HOSTS=["algo-1"]
SM_NETWORK_INTERFACE_NAME=eth0
SM_HPS={"dataset_path":"/opt/ml/input/data/training","epochs":5,"lr":0.0002,"model_id":"bigscience/bloomz-7b1","per_device_train_batch_size":1}
SM_USER_ENTRY_POINT=run_clm.py
SM_FRAMEWORK_PARAMS={}
SM_RESOURCE_CONFIG={"current_group_name":"homogeneousCluster","current_host":"algo-1","current_instance_type":"ml.g5.16xlarge","hosts":["algo-1"],"instance_groups":[{"hosts":["algo-1"],"instance_group_name":"homogeneousCluster","instance_type":"ml.g5.16xlarge"}],"network_interface_name":"eth0"}
SM_INPUT_DATA_CONFIG={"training":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"}}
SM_OUTPUT_DATA_DIR=/opt/ml/output/data
SM_CHANNELS=["training"]
SM_CURRENT_HOST=algo-1
SM_CURRENT_INSTANCE_TYPE=ml.g5.16xlarge
SM_CURRENT_INSTANCE_GROUP=homogeneousCluster
SM_CURRENT_INSTANCE_GROUP_HOSTS=["algo-1"]
SM_INSTANCE_GROUPS=["homogeneousCluster"]
SM_INSTANCE_GROUPS_DICT={"homogeneousCluster":{"hosts":["algo-1"],"instance_group_name":"homogeneousCluster","instance_type":"ml.g5.16xlarge"}}
SM_DISTRIBUTION_INSTANCE_GROUPS=[]
SM_IS_HETERO=false
SM_MODULE_NAME=run_clm
SM_LOG_LEVEL=20
SM_FRAMEWORK_MODULE=sagemaker_pytorch_container.training:main
SM_INPUT_DIR=/opt/ml/input
SM_INPUT_CONFIG_DIR=/opt/ml/input/config
SM_OUTPUT_DIR=/opt/ml/output
SM_NUM_CPUS=64
SM_NUM_GPUS=1
SM_NUM_NEURONS=0
SM_MODEL_DIR=/opt/ml/model
SM_MODULE_DIR=s3://sagemaker-us-east-1-197614225699/huggingface-peft-chat-2023-05-10-14-10--2023-05-10-14-10-27-991/source/sourcedir.tar.gz
SM_TRAINING_ENV={"additional_framework_parameters":{},"channel_input_dirs":{"training":"/opt/ml/input/data/training"},"current_host":"algo-1","current_instance_group":"homogeneousCluster","current_instance_group_hosts":["algo-1"],"current_instance_type":"ml.g5.16xlarge","distribution_hosts":[],"distribution_instance_groups":[],"framework_module":"sagemaker_pytorch_container.training:main","hosts":["algo-1"],"hyperparameters":{"dataset_path":"/opt/ml/input/data/training","epochs":5,"lr":0.0002,"model_id":"bigscience/bloomz-7b1","per_device_train_batch_size":1},"input_config_dir":"/opt/ml/input/config","input_data_config":{"training":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"}},"input_dir":"/opt/ml/input","instance_groups":["homogeneousCluster"],"instance_groups_dict":{"homogeneousCluster":{"hosts":["algo-1"],"instance_group_name":"homogeneousCluster","instance_type":"ml.g5.16xlarge"}},"is_hetero":false,"is_master":true,"is_modelparallel_enabled":null,"is_smddpmprun_installed":true,"job_name":"huggingface-peft-chat-2023-05-10-14-10--2023-05-10-14-10-27-991","log_level":20,"master_hostname":"algo-1","model_dir":"/opt/ml/model","module_dir":"s3://sagemaker-us-east-1-197614225699/huggingface-peft-chat-2023-05-10-14-10--2023-05-10-14-10-27-991/source/sourcedir.tar.gz","module_name":"run_clm","network_interface_name":"eth0","num_cpus":64,"num_gpus":1,"num_neurons":0,"output_data_dir":"/opt/ml/output/data","output_dir":"/opt/ml/output","output_intermediate_dir":"/opt/ml/output/intermediate","resource_config":{"current_group_name":"homogeneousCluster","current_host":"algo-1","current_instance_type":"ml.g5.16xlarge","hosts":["algo-1"],"instance_groups":[{"hosts":["algo-1"],"instance_group_name":"homogeneousCluster","instance_type":"ml.g5.16xlarge"}],"network_interface_name":"eth0"},"user_entry_point":"run_clm.py"}
SM_USER_ARGS=["--dataset_path","/opt/ml/input/data/training","--epochs","5","--lr","0.0002","--model_id","bigscience/bloomz-7b1","--per_device_train_batch_size","1"]
SM_OUTPUT_INTERMEDIATE_DIR=/opt/ml/output/intermediate
SM_CHANNEL_TRAINING=/opt/ml/input/data/training
SM_HP_DATASET_PATH=/opt/ml/input/data/training
SM_HP_EPOCHS=5
SM_HP_LR=0.0002
SM_HP_MODEL_ID=bigscience/bloomz-7b1
SM_HP_PER_DEVICE_TRAIN_BATCH_SIZE=1
PYTHONPATH=/opt/ml/code:/opt/conda/bin:/opt/conda/lib/python39.zip:/opt/conda/lib/python3.9:/opt/conda/lib/python3.9/lib-dynload:/opt/conda/lib/python3.9/site-packages
Invoking script with the following command:
/opt/conda/bin/python3.9 run_clm.py --dataset_path /opt/ml/input/data/training --epochs 5 --lr 0.0002 --model_id bigscience/bloomz-7b1 --per_device_train_batch_size 1
[2023-05-10 14:15:53.081: W smdistributed/modelparallel/torch/nn/predefined_hooks.py:78] Found unsupported HuggingFace version 4.27.1 for automated tensor parallelism. HuggingFace modules will not be automatically distributed. You can use smp.tp_register_with_module API to register desired modules for tensor parallelism, or directly instantiate an smp.nn.DistributedModule. Supported HuggingFace transformers versions for automated tensor parallelism: ['4.17.0', '4.20.1', '4.21.0']
2023-05-10 14:15:53,084 root         INFO     Using NamedTuple = typing._NamedTuple instead.
2023-05-10 14:15:53,105 sagemaker-training-toolkit INFO     Exceptions not imported for SageMaker TF as Tensorflow is not installed.
===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
/opt/conda/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/nvidia/lib64'), PosixPath('/usr/local/nvidia/lib')}
  warn(msg)
/opt/conda/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: /opt/conda/lib/python3.9/site-packages/smdistributed/dataparallel/lib:/opt/amazon/openmpi/lib/:/opt/amazon/efa/lib/:/opt/conda/lib:/usr/local/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/lib did not contain libcudart.so as expected! Searching further paths...
  warn(msg)
/opt/conda/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('arn'), PosixPath('aws'), PosixPath('197614225699'), PosixPath('training-job/huggingface-peft-chat-2023-05-10-14-10--2023-05-10-14-10-27-991'), PosixPath('sagemaker'), PosixPath('us-east-1')}
Downloading (…)"pytorch_model.bin";:  99%|█████████▉| 14.1G/14.1G [00:32<00:00, 475MB/s]
Downloading (…)"pytorch_model.bin";: 100%|█████████▉| 14.1G/14.1G [00:32<00:00, 450MB/s]
Downloading (…)"pytorch_model.bin";: 100%|██████████| 14.1G/14.1G [00:32<00:00, 435MB/s]
trainable params: 3932160 || all params: 7072948224 || trainable%: 0.055594355783029126
0%|          | 0/355 [00:00<?, ?it/s]
[2023-05-10 14:16:51.948: W smdistributed/modelparallel/torch/nn/predefined_hooks.py:78] Found unsupported HuggingFace version 4.27.1 for automated tensor parallelism. HuggingFace modules will not be automatically distributed. You can use smp.tp_register_with_module API to register desired modules for tensor parallelism, or directly instantiate an smp.nn.DistributedModule. Supported HuggingFace transformers versions for automated tensor parallelism: ['4.17.0', '4.20.1', '4.21.0']
INFO:root:Using NamedTuple = typing._NamedTuple instead.
[2023-05-10 14:16:51.975 algo-1:142 INFO utils.py:28] RULE_JOB_STOP_SIGNAL_FILENAME: None
[2023-05-10 14:16:52.000 algo-1:142 INFO profiler_config_parser.py:111] User has disabled profiler.
[2023-05-10 14:16:52.001 algo-1:142 INFO json_config.py:92] Creating hook from json_config at /opt/ml/input/config/debughookconfig.json.
[2023-05-10 14:16:52.001 algo-1:142 INFO hook.py:206] tensorboard_dir has not been set for the hook. SMDebug will not be exporting tensorboard summaries.
[2023-05-10 14:16:52.002 algo-1:142 INFO hook.py:259] Saving to /opt/ml/output/tensors
[2023-05-10 14:16:52.002 algo-1:142 INFO state_store.py:77] The checkpoint config file /opt/ml/input/config/checkpointconfig.json does not exist.
/opt/conda/lib/python3.9/site-packages/bitsandbytes/autograd/_functions.py:298: UserWarning: MatMul8bitLt: inputs will be cast from torch.float32 to float16 during quantization
  warnings.warn(f"MatMul8bitLt: inputs will be cast from {A.dtype} to float16 during quantization")
/opt/conda/lib/python3.9/site-packages/bitsandbytes/autograd/_functions.py:298: UserWarning: MatMul8bitLt: inputs will be cast from torch.bfloat16 to float16 during quantization
  warnings.warn(f"MatMul8bitLt: inputs will be cast from {A.dtype} to float16 during quantization")
0%|          | 1/355 [00:10<1:01:34, 10.44s/it]
1%|          | 2/355 [00:19<57:33,  9.78s/it]
1%|          | 3/355 [00:29<56:12,  9.58s/it]
1%|          | 4/355 [00:38<55:30,  9.49s/it]
1%|▏         | 5/355 [00:47<55:03,  9.44s/it]
2%|▏         | 6/355 [00:57<54:42,  9.41s/it]
2%|▏         | 7/355 [01:06<54:25,  9.38s/it]
2%|▏         | 8/355 [01:15<54:11,  9.37s/it]
3%|▎         | 9/355 [01:25<53:59,  9.36s/it]
3%|▎         | 10/355 [01:34<53:48,  9.36s/it]
{'loss': 2.6462, 'learning_rate': 0.00019436619718309861, 'epoch': 0.14}
3%|▎         | 10/355 [01:34<53:48,  9.36s/it]
3%|▎         | 11/355 [01:43<53:37,  9.35s/it]
3%|▎         | 12/355 [01:53<53:27,  9.35s/it]
4%|▎         | 13/355 [02:02<53:17,  9.35s/it]
4%|▍         | 14/355 [02:11<53:08,  9.35s/it]
4%|▍         | 15/355 [02:21<52:59,  9.35s/it]
5%|▍         | 16/355 [02:30<52:50,  9.35s/it]
5%|▍         | 17/355 [02:39<52:40,  9.35s/it]
5%|▌         | 18/355 [02:49<52:31,  9.35s/it]
5%|▌         | 19/355 [02:58<52:22,  9.35s/it]
6%|▌         | 20/355 [03:08<52:13,  9.35s/it]
6%|▌         | 20/355 [03:08<52:13,  9.35s/it]
{'loss': 2.2786, 'learning_rate': 0.0001887323943661972, 'epoch': 0.28}
6%|▌         | 21/355 [03:17<52:03,  9.35s/it]
6%|▌         | 22/355 [03:26<51:55,  9.35s/it]
6%|▋         | 23/355 [03:36<51:45,  9.36s/it]
7%|▋         | 24/355 [03:45<51:36,  9.36s/it]
7%|▋         | 25/355 [03:54<51:27,  9.36s/it]
7%|▋         | 26/355 [04:04<51:18,  9.36s/it]
8%|▊         | 27/355 [04:13<51:09,  9.36s/it]
8%|▊         | 28/355 [04:22<51:00,  9.36s/it]
8%|▊         | 29/355 [04:32<50:51,  9.36s/it]
8%|▊         | 30/355 [04:41<50:42,  9.36s/it]
8%|▊         | 30/355 [04:41<50:42,  9.36s/it]
{'loss': 2.1658, 'learning_rate': 0.0001830985915492958, 'epoch': 0.42}
9%|▊         | 31/355 [04:50<50:33,  9.36s/it]
9%|▉         | 32/355 [05:00<50:24,  9.36s/it]
9%|▉         | 33/355 [05:09<50:14,  9.36s/it]
10%|▉         | 34/355 [05:19<50:04,  9.36s/it]
10%|▉         | 35/355 [05:28<49:54,  9.36s/it]
10%|█         | 36/355 [05:37<49:45,  9.36s/it]
10%|█         | 37/355 [05:47<49:35,  9.36s/it]
11%|█         | 38/355 [05:56<49:26,  9.36s/it]
11%|█         | 39/355 [06:05<49:16,  9.36s/it]
11%|█▏        | 40/355 [06:15<49:07,  9.36s/it]
11%|█▏        | 40/355 [06:15<49:07,  9.36s/it]
{'loss': 2.1375, 'learning_rate': 0.00017746478873239437, 'epoch': 0.56}
12%|█▏        | 41/355 [06:24<48:58,  9.36s/it]
12%|█▏        | 42/355 [06:33<48:49,  9.36s/it]
12%|█▏        | 43/355 [06:43<48:39,  9.36s/it]
12%|█▏        | 44/355 [06:52<48:30,  9.36s/it]
13%|█▎        | 45/355 [07:01<48:20,  9.36s/it]
13%|█▎        | 46/355 [07:11<48:11,  9.36s/it]
13%|█▎        | 47/355 [07:20<48:01,  9.36s/it]
14%|█▎        | 48/355 [07:30<47:52,  9.36s/it]
14%|█▍        | 49/355 [07:39<47:43,  9.36s/it]
14%|█▍        | 50/355 [07:48<47:33,  9.36s/it]
14%|█▍        | 50/355 [07:48<47:33,  9.36s/it]
{'loss': 2.0312, 'learning_rate': 0.00017183098591549295, 'epoch': 0.7}
14%|█▍        | 51/355 [07:58<47:24,  9.36s/it]
15%|█▍        | 52/355 [08:07<47:14,  9.35s/it]
15%|█▍        | 53/355 [08:16<47:04,  9.35s/it]
15%|█▌        | 54/355 [08:26<46:55,  9.35s/it]
15%|█▌        | 55/355 [08:35<46:45,  9.35s/it]
16%|█▌        | 56/355 [08:44<46:36,  9.35s/it]
16%|█▌        | 57/355 [08:54<46:27,  9.35s/it]
16%|█▋        | 58/355 [09:03<46:17,  9.35s/it]
17%|█▋        | 59/355 [09:12<46:09,  9.36s/it]
17%|█▋        | 60/355 [09:22<46:00,  9.36s/it]
17%|█▋        | 60/355 [09:22<46:00,  9.36s/it]
{'loss': 2.0451, 'learning_rate': 0.00016619718309859155, 'epoch': 0.84}
17%|█▋        | 61/355 [09:31<45:50,  9.36s/it]
╭───────────────────── Traceback (most recent call last) ──────────────────────╮

UnexpectedStatusException: Error for Training job huggingface-peft-chat-2023-05-10-14-10--2023-05-10-14-10-27-991: Failed. Reason: AlgorithmError: ExecuteUserScriptError:
ExitCode 1
ErrorMessage "OutOfMemoryError: CUDA out of memory. Tried to allocate 1.91 GiB (GPU 0; 22.19
 GiB total capacity; 15.39 GiB already allocated; 1.56 GiB free; 19.44 GiB
 reserved in total by PyTorch) If reserved memory is >> allocated memory try
 setting max_split_size_mb to avoid fragmentation.  See documentation for Memory
 Management and PYTORCH_CUDA_ALLOC_CONF
 17%|█▋        | 61/355 [09:38<46:26,  9.48s/it]"
Command "/opt/conda/bin/python3.9 run_clm.py --dataset_path /opt/ml/input/data/training --epochs 5 --lr 0.0002 --model_id bigscience/bloomz-7b1 --per_device_train_batch_size 1", exit code: 1

Can you try decreasing the batch size.

Thank you Phillip.

Restarting the Compute Instance has fixed this problem for me.

Hi guys, i am facing the same issue. I have set 'per_device_train_batch_size': 1 and restard the notebook instance but the problem still appear. Any suggestions ?
Thanks