Training set up:
import time
# define Training Job Name
job_name = f'huggingface-peft-chat-{time.strftime("%Y-%m-%d-%H-%M-%S", time.localtime())}'
from sagemaker.huggingface import HuggingFace
# hyperparameters, which are passed into the training job
hyperparameters ={
'model_id': model_id, # pre-trained model
'dataset_path': '/opt/ml/input/data/training', # path where sagemaker will save training dataset
'epochs': 5, # number of training epochs
'per_device_train_batch_size': 1, # batch size for training
'lr': 2e-4, # learning rate used during training
}
# create the Estimator
huggingface_estimator = HuggingFace(
entry_point = 'run_clm.py', # train script
source_dir = 'scripts', # directory which includes all the files needed for training
instance_type = 'ml.g5.16xlarge', # instances type used for the training job
instance_count = 1, # the number of instances used for training
base_job_name = job_name, # the name of the training job
role = role, # Iam role used in training job to access AWS ressources, e.g. S3
volume_size = 300, # the size of the EBS volume in GB
transformers_version = '4.26', # the transformers version used in the training job
pytorch_version = '1.13', # the pytorch_version version used in the training job
py_version = 'py39', # the python version used in the training job
hyperparameters = hyperparameters
)
Requirements file:
git+https://github.com/huggingface/peft.git
transformers==4.27.1
accelerate==0.17.1
bitsandbytes==0.37.1
Training error:
INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
Using provided s3_resource
INFO:sagemaker:Creating training-job with name: huggingface-peft-chat-2023-05-10-14-10--2023-05-10-14-10-27-991
2023-05-10 14:10:28 Starting - Starting the training job...
2023-05-10 14:10:49 Starting - Preparing the instances for training......
2023-05-10 14:11:56 Downloading - Downloading input data...
2023-05-10 14:12:11 Training - Downloading the training image...............
2023-05-10 14:14:52 Training - Training image download completed. Training in progress......bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
2023-05-10 14:15:37,684 sagemaker-training-toolkit INFO Imported framework sagemaker_pytorch_container.training
2023-05-10 14:15:37,699 sagemaker-training-toolkit INFO No Neurons detected (normal if no neurons installed)
2023-05-10 14:15:37,709 sagemaker_pytorch_container.training INFO Block until all host DNS lookups succeed.
2023-05-10 14:15:37,711 sagemaker_pytorch_container.training INFO Invoking user training script.
2023-05-10 14:15:37,926 sagemaker-training-toolkit INFO Installing dependencies from requirements.txt:
/opt/conda/bin/python3.9 -m pip install -r requirements.txt
Collecting git+https://github.com/huggingface/peft.git (from -r requirements.txt (line 1))
Cloning https://github.com/huggingface/peft.git to /tmp/pip-req-build-dkjr7e3k
Running command git clone --filter=blob:none --quiet https://github.com/huggingface/peft.git /tmp/pip-req-build-dkjr7e3k
Resolved https://github.com/huggingface/peft.git to commit 4fd374e80d670781c0d82c96ce94d1215ff23306
Installing build dependencies: started
Installing build dependencies: finished with status 'done'
Getting requirements to build wheel: started
Getting requirements to build wheel: finished with status 'done'
Preparing metadata (pyproject.toml): started
Preparing metadata (pyproject.toml): finished with status 'done'
Collecting transformers==4.27.1
Downloading transformers-4.27.1-py3-none-any.whl (6.7 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.7/6.7 MB 92.5 MB/s eta 0:00:00
Collecting accelerate==0.17.1
Downloading accelerate-0.17.1-py3-none-any.whl (212 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 212.8/212.8 kB 50.8 MB/s eta 0:00:00
Collecting bitsandbytes==0.37.1
Downloading bitsandbytes-0.37.1-py3-none-any.whl (76.3 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 76.3/76.3 MB 34.7 MB/s eta 0:00:00
Requirement already satisfied: regex!=2019.12.17 in /opt/conda/lib/python3.9/site-packages (from transformers==4.27.1->-r requirements.txt (line 2)) (2022.10.31)
Requirement already satisfied: requests in /opt/conda/lib/python3.9/site-packages (from transformers==4.27.1->-r requirements.txt (line 2)) (2.28.2)
Requirement already satisfied: tokenizers!=0.11.3,<0.14,>=0.11.1 in /opt/conda/lib/python3.9/site-packages (from transformers==4.27.1->-r requirements.txt (line 2)) (0.13.2)
Requirement already satisfied: packaging>=20.0 in /opt/conda/lib/python3.9/site-packages (from transformers==4.27.1->-r requirements.txt (line 2)) (23.0)
Requirement already satisfied: huggingface-hub<1.0,>=0.11.0 in /opt/conda/lib/python3.9/site-packages (from transformers==4.27.1->-r requirements.txt (line 2)) (0.12.0)
Requirement already satisfied: pyyaml>=5.1 in /opt/conda/lib/python3.9/site-packages (from transformers==4.27.1->-r requirements.txt (line 2)) (5.4.1)
Requirement already satisfied: tqdm>=4.27 in /opt/conda/lib/python3.9/site-packages (from transformers==4.27.1->-r requirements.txt (line 2)) (4.64.1)
Requirement already satisfied: numpy>=1.17 in /opt/conda/lib/python3.9/site-packages (from transformers==4.27.1->-r requirements.txt (line 2)) (1.23.5)
Requirement already satisfied: filelock in /opt/conda/lib/python3.9/site-packages (from transformers==4.27.1->-r requirements.txt (line 2)) (3.9.0)
Requirement already satisfied: torch>=1.4.0 in /opt/conda/lib/python3.9/site-packages (from accelerate==0.17.1->-r requirements.txt (line 3)) (1.13.1+cu117)
Requirement already satisfied: psutil in /opt/conda/lib/python3.9/site-packages (from accelerate==0.17.1->-r requirements.txt (line 3)) (5.9.4)
Requirement already satisfied: typing-extensions>=3.7.4.3 in /opt/conda/lib/python3.9/site-packages (from huggingface-hub<1.0,>=0.11.0->transformers==4.27.1->-r requirements.txt (line 2)) (4.4.0)
Requirement already satisfied: charset-normalizer<4,>=2 in /opt/conda/lib/python3.9/site-packages (from requests->transformers==4.27.1->-r requirements.txt (line 2)) (2.1.1)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /opt/conda/lib/python3.9/site-packages (from requests->transformers==4.27.1->-r requirements.txt (line 2)) (1.26.14)
Requirement already satisfied: idna<4,>=2.5 in /opt/conda/lib/python3.9/site-packages (from requests->transformers==4.27.1->-r requirements.txt (line 2)) (3.4)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.9/site-packages (from requests->transformers==4.27.1->-r requirements.txt (line 2)) (2022.12.7)
Building wheels for collected packages: peft
Building wheel for peft (pyproject.toml): started
Building wheel for peft (pyproject.toml): finished with status 'done'
Created wheel for peft: filename=peft-0.4.0.dev0-py3-none-any.whl size=56306 sha256=73c1e5f8f4d7e5b949f205fc7586cf396cf810571a1d36e3df2f650cc8c9f205
Stored in directory: /tmp/pip-ephem-wheel-cache-l1th6vic/wheels/2d/60/1b/0edd9dc0f0c489738b1166bc1b0b560ee368f7721f89d06e3a
Successfully built peft
Installing collected packages: bitsandbytes, accelerate, transformers, peft
Attempting uninstall: accelerate
Found existing installation: accelerate 0.16.0
Uninstalling accelerate-0.16.0:
Successfully uninstalled accelerate-0.16.0
Attempting uninstall: transformers
Found existing installation: transformers 4.26.0
Uninstalling transformers-4.26.0:
Successfully uninstalled transformers-4.26.0
Successfully installed accelerate-0.17.1 bitsandbytes-0.37.1 peft-0.4.0.dev0 transformers-4.27.1
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
[notice] A new release of pip is available: 23.0 -> 23.1.2
[notice] To update, run: pip install --upgrade pip
2023-05-10 14:15:51,416 sagemaker-training-toolkit INFO Waiting for the process to finish and give a return code.
2023-05-10 14:15:51,416 sagemaker-training-toolkit INFO Done waiting for a return code. Received 0 from exiting process.
2023-05-10 14:15:51,434 sagemaker-training-toolkit INFO No Neurons detected (normal if no neurons installed)
2023-05-10 14:15:51,462 sagemaker-training-toolkit INFO No Neurons detected (normal if no neurons installed)
2023-05-10 14:15:51,490 sagemaker-training-toolkit INFO No Neurons detected (normal if no neurons installed)
2023-05-10 14:15:51,501 sagemaker-training-toolkit INFO Invoking user script
Training Env:
{
"additional_framework_parameters": {},
"channel_input_dirs": {
"training": "/opt/ml/input/data/training"
},
"current_host": "algo-1",
"current_instance_group": "homogeneousCluster",
"current_instance_group_hosts": [
"algo-1"
],
"current_instance_type": "ml.g5.16xlarge",
"distribution_hosts": [],
"distribution_instance_groups": [],
"framework_module": "sagemaker_pytorch_container.training:main",
"hosts": [
"algo-1"
],
"hyperparameters": {
"dataset_path": "/opt/ml/input/data/training",
"epochs": 5,
"lr": 0.0002,
"model_id": "bigscience/bloomz-7b1",
"per_device_train_batch_size": 1
},
"input_config_dir": "/opt/ml/input/config",
"input_data_config": {
"training": {
"TrainingInputMode": "File",
"S3DistributionType": "FullyReplicated",
"RecordWrapperType": "None"
}
},
"input_dir": "/opt/ml/input",
"instance_groups": [
"homogeneousCluster"
],
"instance_groups_dict": {
"homogeneousCluster": {
"instance_group_name": "homogeneousCluster",
"instance_type": "ml.g5.16xlarge",
"hosts": [
"algo-1"
]
}
},
"is_hetero": false,
"is_master": true,
"is_modelparallel_enabled": null,
"is_smddpmprun_installed": true,
"job_name": "huggingface-peft-chat-2023-05-10-14-10--2023-05-10-14-10-27-991",
"log_level": 20,
"master_hostname": "algo-1",
"model_dir": "/opt/ml/model",
"module_dir": "s3://sagemaker-us-east-1-197614225699/huggingface-peft-chat-2023-05-10-14-10--2023-05-10-14-10-27-991/source/sourcedir.tar.gz",
"module_name": "run_clm",
"network_interface_name": "eth0",
"num_cpus": 64,
"num_gpus": 1,
"num_neurons": 0,
"output_data_dir": "/opt/ml/output/data",
"output_dir": "/opt/ml/output",
"output_intermediate_dir": "/opt/ml/output/intermediate",
"resource_config": {
"current_host": "algo-1",
"current_instance_type": "ml.g5.16xlarge",
"current_group_name": "homogeneousCluster",
"hosts": [
"algo-1"
],
"instance_groups": [
{
"instance_group_name": "homogeneousCluster",
"instance_type": "ml.g5.16xlarge",
"hosts": [
"algo-1"
]
}
],
"network_interface_name": "eth0"
},
"user_entry_point": "run_clm.py"
}
Environment variables:
SM_HOSTS=["algo-1"]
SM_NETWORK_INTERFACE_NAME=eth0
SM_HPS={"dataset_path":"/opt/ml/input/data/training","epochs":5,"lr":0.0002,"model_id":"bigscience/bloomz-7b1","per_device_train_batch_size":1}
SM_USER_ENTRY_POINT=run_clm.py
SM_FRAMEWORK_PARAMS={}
SM_RESOURCE_CONFIG={"current_group_name":"homogeneousCluster","current_host":"algo-1","current_instance_type":"ml.g5.16xlarge","hosts":["algo-1"],"instance_groups":[{"hosts":["algo-1"],"instance_group_name":"homogeneousCluster","instance_type":"ml.g5.16xlarge"}],"network_interface_name":"eth0"}
SM_INPUT_DATA_CONFIG={"training":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"}}
SM_OUTPUT_DATA_DIR=/opt/ml/output/data
SM_CHANNELS=["training"]
SM_CURRENT_HOST=algo-1
SM_CURRENT_INSTANCE_TYPE=ml.g5.16xlarge
SM_CURRENT_INSTANCE_GROUP=homogeneousCluster
SM_CURRENT_INSTANCE_GROUP_HOSTS=["algo-1"]
SM_INSTANCE_GROUPS=["homogeneousCluster"]
SM_INSTANCE_GROUPS_DICT={"homogeneousCluster":{"hosts":["algo-1"],"instance_group_name":"homogeneousCluster","instance_type":"ml.g5.16xlarge"}}
SM_DISTRIBUTION_INSTANCE_GROUPS=[]
SM_IS_HETERO=false
SM_MODULE_NAME=run_clm
SM_LOG_LEVEL=20
SM_FRAMEWORK_MODULE=sagemaker_pytorch_container.training:main
SM_INPUT_DIR=/opt/ml/input
SM_INPUT_CONFIG_DIR=/opt/ml/input/config
SM_OUTPUT_DIR=/opt/ml/output
SM_NUM_CPUS=64
SM_NUM_GPUS=1
SM_NUM_NEURONS=0
SM_MODEL_DIR=/opt/ml/model
SM_MODULE_DIR=s3://sagemaker-us-east-1-197614225699/huggingface-peft-chat-2023-05-10-14-10--2023-05-10-14-10-27-991/source/sourcedir.tar.gz
SM_TRAINING_ENV={"additional_framework_parameters":{},"channel_input_dirs":{"training":"/opt/ml/input/data/training"},"current_host":"algo-1","current_instance_group":"homogeneousCluster","current_instance_group_hosts":["algo-1"],"current_instance_type":"ml.g5.16xlarge","distribution_hosts":[],"distribution_instance_groups":[],"framework_module":"sagemaker_pytorch_container.training:main","hosts":["algo-1"],"hyperparameters":{"dataset_path":"/opt/ml/input/data/training","epochs":5,"lr":0.0002,"model_id":"bigscience/bloomz-7b1","per_device_train_batch_size":1},"input_config_dir":"/opt/ml/input/config","input_data_config":{"training":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"}},"input_dir":"/opt/ml/input","instance_groups":["homogeneousCluster"],"instance_groups_dict":{"homogeneousCluster":{"hosts":["algo-1"],"instance_group_name":"homogeneousCluster","instance_type":"ml.g5.16xlarge"}},"is_hetero":false,"is_master":true,"is_modelparallel_enabled":null,"is_smddpmprun_installed":true,"job_name":"huggingface-peft-chat-2023-05-10-14-10--2023-05-10-14-10-27-991","log_level":20,"master_hostname":"algo-1","model_dir":"/opt/ml/model","module_dir":"s3://sagemaker-us-east-1-197614225699/huggingface-peft-chat-2023-05-10-14-10--2023-05-10-14-10-27-991/source/sourcedir.tar.gz","module_name":"run_clm","network_interface_name":"eth0","num_cpus":64,"num_gpus":1,"num_neurons":0,"output_data_dir":"/opt/ml/output/data","output_dir":"/opt/ml/output","output_intermediate_dir":"/opt/ml/output/intermediate","resource_config":{"current_group_name":"homogeneousCluster","current_host":"algo-1","current_instance_type":"ml.g5.16xlarge","hosts":["algo-1"],"instance_groups":[{"hosts":["algo-1"],"instance_group_name":"homogeneousCluster","instance_type":"ml.g5.16xlarge"}],"network_interface_name":"eth0"},"user_entry_point":"run_clm.py"}
SM_USER_ARGS=["--dataset_path","/opt/ml/input/data/training","--epochs","5","--lr","0.0002","--model_id","bigscience/bloomz-7b1","--per_device_train_batch_size","1"]
SM_OUTPUT_INTERMEDIATE_DIR=/opt/ml/output/intermediate
SM_CHANNEL_TRAINING=/opt/ml/input/data/training
SM_HP_DATASET_PATH=/opt/ml/input/data/training
SM_HP_EPOCHS=5
SM_HP_LR=0.0002
SM_HP_MODEL_ID=bigscience/bloomz-7b1
SM_HP_PER_DEVICE_TRAIN_BATCH_SIZE=1
PYTHONPATH=/opt/ml/code:/opt/conda/bin:/opt/conda/lib/python39.zip:/opt/conda/lib/python3.9:/opt/conda/lib/python3.9/lib-dynload:/opt/conda/lib/python3.9/site-packages
Invoking script with the following command:
/opt/conda/bin/python3.9 run_clm.py --dataset_path /opt/ml/input/data/training --epochs 5 --lr 0.0002 --model_id bigscience/bloomz-7b1 --per_device_train_batch_size 1
[2023-05-10 14:15:53.081: W smdistributed/modelparallel/torch/nn/predefined_hooks.py:78] Found unsupported HuggingFace version 4.27.1 for automated tensor parallelism. HuggingFace modules will not be automatically distributed. You can use smp.tp_register_with_module API to register desired modules for tensor parallelism, or directly instantiate an smp.nn.DistributedModule. Supported HuggingFace transformers versions for automated tensor parallelism: ['4.17.0', '4.20.1', '4.21.0']
2023-05-10 14:15:53,084 root INFO Using NamedTuple = typing._NamedTuple instead.
2023-05-10 14:15:53,105 sagemaker-training-toolkit INFO Exceptions not imported for SageMaker TF as Tensorflow is not installed.
===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
/opt/conda/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/nvidia/lib64'), PosixPath('/usr/local/nvidia/lib')}
warn(msg)
/opt/conda/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: /opt/conda/lib/python3.9/site-packages/smdistributed/dataparallel/lib:/opt/amazon/openmpi/lib/:/opt/amazon/efa/lib/:/opt/conda/lib:/usr/local/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/lib did not contain libcudart.so as expected! Searching further paths...
warn(msg)
/opt/conda/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('arn'), PosixPath('aws'), PosixPath('197614225699'), PosixPath('training-job/huggingface-peft-chat-2023-05-10-14-10--2023-05-10-14-10-27-991'), PosixPath('sagemaker'), PosixPath('us-east-1')}
Downloading (…)"pytorch_model.bin";: 99%|█████████▉| 14.1G/14.1G [00:32<00:00, 475MB/s]
Downloading (…)"pytorch_model.bin";: 100%|█████████▉| 14.1G/14.1G [00:32<00:00, 450MB/s]
Downloading (…)"pytorch_model.bin";: 100%|██████████| 14.1G/14.1G [00:32<00:00, 435MB/s]
trainable params: 3932160 || all params: 7072948224 || trainable%: 0.055594355783029126
0%| | 0/355 [00:00<?, ?it/s]
[2023-05-10 14:16:51.948: W smdistributed/modelparallel/torch/nn/predefined_hooks.py:78] Found unsupported HuggingFace version 4.27.1 for automated tensor parallelism. HuggingFace modules will not be automatically distributed. You can use smp.tp_register_with_module API to register desired modules for tensor parallelism, or directly instantiate an smp.nn.DistributedModule. Supported HuggingFace transformers versions for automated tensor parallelism: ['4.17.0', '4.20.1', '4.21.0']
INFO:root:Using NamedTuple = typing._NamedTuple instead.
[2023-05-10 14:16:51.975 algo-1:142 INFO utils.py:28] RULE_JOB_STOP_SIGNAL_FILENAME: None
[2023-05-10 14:16:52.000 algo-1:142 INFO profiler_config_parser.py:111] User has disabled profiler.
[2023-05-10 14:16:52.001 algo-1:142 INFO json_config.py:92] Creating hook from json_config at /opt/ml/input/config/debughookconfig.json.
[2023-05-10 14:16:52.001 algo-1:142 INFO hook.py:206] tensorboard_dir has not been set for the hook. SMDebug will not be exporting tensorboard summaries.
[2023-05-10 14:16:52.002 algo-1:142 INFO hook.py:259] Saving to /opt/ml/output/tensors
[2023-05-10 14:16:52.002 algo-1:142 INFO state_store.py:77] The checkpoint config file /opt/ml/input/config/checkpointconfig.json does not exist.
/opt/conda/lib/python3.9/site-packages/bitsandbytes/autograd/_functions.py:298: UserWarning: MatMul8bitLt: inputs will be cast from torch.float32 to float16 during quantization
warnings.warn(f"MatMul8bitLt: inputs will be cast from {A.dtype} to float16 during quantization")
/opt/conda/lib/python3.9/site-packages/bitsandbytes/autograd/_functions.py:298: UserWarning: MatMul8bitLt: inputs will be cast from torch.bfloat16 to float16 during quantization
warnings.warn(f"MatMul8bitLt: inputs will be cast from {A.dtype} to float16 during quantization")
0%| | 1/355 [00:10<1:01:34, 10.44s/it]
1%| | 2/355 [00:19<57:33, 9.78s/it]
1%| | 3/355 [00:29<56:12, 9.58s/it]
1%| | 4/355 [00:38<55:30, 9.49s/it]
1%|▏ | 5/355 [00:47<55:03, 9.44s/it]
2%|▏ | 6/355 [00:57<54:42, 9.41s/it]
2%|▏ | 7/355 [01:06<54:25, 9.38s/it]
2%|▏ | 8/355 [01:15<54:11, 9.37s/it]
3%|▎ | 9/355 [01:25<53:59, 9.36s/it]
3%|▎ | 10/355 [01:34<53:48, 9.36s/it]
{'loss': 2.6462, 'learning_rate': 0.00019436619718309861, 'epoch': 0.14}
3%|▎ | 10/355 [01:34<53:48, 9.36s/it]
3%|▎ | 11/355 [01:43<53:37, 9.35s/it]
3%|▎ | 12/355 [01:53<53:27, 9.35s/it]
4%|▎ | 13/355 [02:02<53:17, 9.35s/it]
4%|▍ | 14/355 [02:11<53:08, 9.35s/it]
4%|▍ | 15/355 [02:21<52:59, 9.35s/it]
5%|▍ | 16/355 [02:30<52:50, 9.35s/it]
5%|▍ | 17/355 [02:39<52:40, 9.35s/it]
5%|▌ | 18/355 [02:49<52:31, 9.35s/it]
5%|▌ | 19/355 [02:58<52:22, 9.35s/it]
6%|▌ | 20/355 [03:08<52:13, 9.35s/it]
6%|▌ | 20/355 [03:08<52:13, 9.35s/it]
{'loss': 2.2786, 'learning_rate': 0.0001887323943661972, 'epoch': 0.28}
6%|▌ | 21/355 [03:17<52:03, 9.35s/it]
6%|▌ | 22/355 [03:26<51:55, 9.35s/it]
6%|▋ | 23/355 [03:36<51:45, 9.36s/it]
7%|▋ | 24/355 [03:45<51:36, 9.36s/it]
7%|▋ | 25/355 [03:54<51:27, 9.36s/it]
7%|▋ | 26/355 [04:04<51:18, 9.36s/it]
8%|▊ | 27/355 [04:13<51:09, 9.36s/it]
8%|▊ | 28/355 [04:22<51:00, 9.36s/it]
8%|▊ | 29/355 [04:32<50:51, 9.36s/it]
8%|▊ | 30/355 [04:41<50:42, 9.36s/it]
8%|▊ | 30/355 [04:41<50:42, 9.36s/it]
{'loss': 2.1658, 'learning_rate': 0.0001830985915492958, 'epoch': 0.42}
9%|▊ | 31/355 [04:50<50:33, 9.36s/it]
9%|▉ | 32/355 [05:00<50:24, 9.36s/it]
9%|▉ | 33/355 [05:09<50:14, 9.36s/it]
10%|▉ | 34/355 [05:19<50:04, 9.36s/it]
10%|▉ | 35/355 [05:28<49:54, 9.36s/it]
10%|█ | 36/355 [05:37<49:45, 9.36s/it]
10%|█ | 37/355 [05:47<49:35, 9.36s/it]
11%|█ | 38/355 [05:56<49:26, 9.36s/it]
11%|█ | 39/355 [06:05<49:16, 9.36s/it]
11%|█▏ | 40/355 [06:15<49:07, 9.36s/it]
11%|█▏ | 40/355 [06:15<49:07, 9.36s/it]
{'loss': 2.1375, 'learning_rate': 0.00017746478873239437, 'epoch': 0.56}
12%|█▏ | 41/355 [06:24<48:58, 9.36s/it]
12%|█▏ | 42/355 [06:33<48:49, 9.36s/it]
12%|█▏ | 43/355 [06:43<48:39, 9.36s/it]
12%|█▏ | 44/355 [06:52<48:30, 9.36s/it]
13%|█▎ | 45/355 [07:01<48:20, 9.36s/it]
13%|█▎ | 46/355 [07:11<48:11, 9.36s/it]
13%|█▎ | 47/355 [07:20<48:01, 9.36s/it]
14%|█▎ | 48/355 [07:30<47:52, 9.36s/it]
14%|█▍ | 49/355 [07:39<47:43, 9.36s/it]
14%|█▍ | 50/355 [07:48<47:33, 9.36s/it]
14%|█▍ | 50/355 [07:48<47:33, 9.36s/it]
{'loss': 2.0312, 'learning_rate': 0.00017183098591549295, 'epoch': 0.7}
14%|█▍ | 51/355 [07:58<47:24, 9.36s/it]
15%|█▍ | 52/355 [08:07<47:14, 9.35s/it]
15%|█▍ | 53/355 [08:16<47:04, 9.35s/it]
15%|█▌ | 54/355 [08:26<46:55, 9.35s/it]
15%|█▌ | 55/355 [08:35<46:45, 9.35s/it]
16%|█▌ | 56/355 [08:44<46:36, 9.35s/it]
16%|█▌ | 57/355 [08:54<46:27, 9.35s/it]
16%|█▋ | 58/355 [09:03<46:17, 9.35s/it]
17%|█▋ | 59/355 [09:12<46:09, 9.36s/it]
17%|█▋ | 60/355 [09:22<46:00, 9.36s/it]
17%|█▋ | 60/355 [09:22<46:00, 9.36s/it]
{'loss': 2.0451, 'learning_rate': 0.00016619718309859155, 'epoch': 0.84}
17%|█▋ | 61/355 [09:31<45:50, 9.36s/it]
╭───────────────────── Traceback (most recent call last) ──────────────────────╮
UnexpectedStatusException: Error for Training job huggingface-peft-chat-2023-05-10-14-10--2023-05-10-14-10-27-991: Failed. Reason: AlgorithmError: ExecuteUserScriptError:
ExitCode 1
ErrorMessage "OutOfMemoryError: CUDA out of memory. Tried to allocate 1.91 GiB (GPU 0; 22.19
GiB total capacity; 15.39 GiB already allocated; 1.56 GiB free; 19.44 GiB
reserved in total by PyTorch) If reserved memory is >> allocated memory try
setting max_split_size_mb to avoid fragmentation. See documentation for Memory
Management and PYTORCH_CUDA_ALLOC_CONF
17%|█▋ | 61/355 [09:38<46:26, 9.48s/it]"
Command "/opt/conda/bin/python3.9 run_clm.py --dataset_path /opt/ml/input/data/training --epochs 5 --lr 0.0002 --model_id bigscience/bloomz-7b1 --per_device_train_batch_size 1", exit code: 1