I’m running my training scrip ton SageMaker estimator, but oddly running into this issue where bitsandbytes is not found ErrorMessage "ModuleNotFoundError: No module named 'bitsandbytes'"
Pretty sure I’m just missing something stupid, but I’m able to spot anything. Am I doing anything wrong here?
import time
from sagemaker.huggingface import HuggingFace
from huggingface_hub import HfFolder
# define Training Job Name
job_name = f'huggingface-qlora-{time.strftime("%Y-%m-%d-%H-%M-%S", time.localtime())}'
model_id = 'meta-llama/Llama-2-7b-hf'
# hyperparameters, which are passed into the training job
hyperparameters ={
'model_id': model_id, # pre-trained model
'dataset_path': '/opt/ml/input/data/training', # path where sagemaker will save training dataset
'epochs': 3, # number of training epochs
'per_device_train_batch_size': 2, # batch size for training
'lr': 2e-4, # learning rate used during training
'hf_token': HfFolder.get_token(), # huggingface token to access llama 2
'merge_weights': True, # wether to merge LoRA into the model (needs more memory)
}
# create the Estimator
huggingface_estimator = HuggingFace(
entry_point = 'train.py', # train script
source_dir = 'scripts', # directory which includes all the files needed for training
instance_type = 'ml.g5.8xlarge', # instances type used for the training job
instance_count = 1, # the number of instances used for training
base_job_name = job_name, # the name of the training job
role = role, # Iam role used in training job to access AWS ressources, e.g. S3
volume_size = 300, # the size of the EBS volume in GB
transformers_version = '4.28', # the transformers version used in the training job
pytorch_version = '2.0', # the pytorch_version version used in the training job
py_version = 'py310', # the python version used in the training job
hyperparameters = hyperparameters, # the hyperparameters passed to the training job
environment = { "HUGGINGFACE_HUB_CACHE": "/tmp/.cache" }, # set env variable to cache models in /tmp
)
data = {'training': training_input_path}
# starting the train job with our uploaded datasets as input
huggingface_estimator.fit(data, wait=True)
INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
Using provided s3_resource
INFO:sagemaker:Creating training-job with name: huggingface-qlora-2023-09-04-20-49-46-2023-09-04-21-07-14-897
2023-09-04 21:07:15 Starting - Starting the training job...
2023-09-04 21:07:41 Starting - Preparing the instances for training......
2023-09-04 21:08:28 Downloading - Downloading input data...
2023-09-04 21:08:48 Training - Downloading the training image..................
2023-09-04 21:12:00 Training - Training image download completed. Training in progress.......bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
2023-09-04 21:13:06,200 sagemaker-training-toolkit INFO Imported framework sagemaker_pytorch_container.training
2023-09-04 21:13:06,214 sagemaker-training-toolkit INFO No Neurons detected (normal if no neurons installed)
2023-09-04 21:13:06,223 sagemaker_pytorch_container.training INFO Block until all host DNS lookups succeed.
2023-09-04 21:13:06,225 sagemaker_pytorch_container.training INFO Invoking user training script.
2023-09-04 21:13:07,563 sagemaker-training-toolkit INFO No Neurons detected (normal if no neurons installed)
2023-09-04 21:13:07,798 sagemaker-training-toolkit INFO No Neurons detected (normal if no neurons installed)
2023-09-04 21:13:07,821 sagemaker-training-toolkit INFO No Neurons detected (normal if no neurons installed)
2023-09-04 21:13:07,832 sagemaker-training-toolkit INFO Invoking user script
Training Env:
{
"additional_framework_parameters": {},
"channel_input_dirs": {
"training": "/opt/ml/input/data/training"
},
"current_host": "algo-1",
"current_instance_group": "homogeneousCluster",
"current_instance_group_hosts": [
"algo-1"
],
"current_instance_type": "ml.g5.8xlarge",
"distribution_hosts": [],
"distribution_instance_groups": [],
"framework_module": "sagemaker_pytorch_container.training:main",
"hosts": [
"algo-1"
],
"hyperparameters": {
"dataset_path": "/opt/ml/input/data/training",
"epochs": 3,
"hf_token": "***********",
"lr": 0.0002,
"merge_weights": true,
"model_id": "meta-llama/Llama-2-7b-hf",
"per_device_train_batch_size": 2
},
"input_config_dir": "/opt/ml/input/config",
"input_data_config": {
"training": {
"TrainingInputMode": "File",
"S3DistributionType": "FullyReplicated",
"RecordWrapperType": "None"
}
},
"input_dir": "/opt/ml/input",
"instance_groups": [
"homogeneousCluster"
],
"instance_groups_dict": {
"homogeneousCluster": {
"instance_group_name": "homogeneousCluster",
"instance_type": "ml.g5.8xlarge",
"hosts": [
"algo-1"
]
}
},
"is_hetero": false,
"is_master": true,
"is_modelparallel_enabled": null,
"is_smddpmprun_installed": true,
"job_name": "huggingface-qlora-2023-09-04-20-49-46-2023-09-04-21-07-14-897",
"log_level": 20,
"master_hostname": "algo-1",
"model_dir": "/opt/ml/model",
"module_dir": "s3://sagemaker-us-east-1-535772764458/huggingface-qlora-2023-09-04-20-49-46-2023-09-04-21-07-14-897/source/sourcedir.tar.gz",
"module_name": "train",
"network_interface_name": "eth0",
"num_cpus": 32,
"num_gpus": 1,
"num_neurons": 0,
"output_data_dir": "/opt/ml/output/data",
"output_dir": "/opt/ml/output",
"output_intermediate_dir": "/opt/ml/output/intermediate",
"resource_config": {
"current_host": "algo-1",
"current_instance_type": "ml.g5.8xlarge",
"current_group_name": "homogeneousCluster",
"hosts": [
"algo-1"
],
"instance_groups": [
{
"instance_group_name": "homogeneousCluster",
"instance_type": "ml.g5.8xlarge",
"hosts": [
"algo-1"
]
}
],
"network_interface_name": "eth0"
},
"user_entry_point": "train.py"
}
Environment variables:
SM_HOSTS=["algo-1"]
SM_NETWORK_INTERFACE_NAME=eth0
SM_HPS={"dataset_path":"/opt/ml/input/data/training","epochs":3,"hf_token":"*************","lr":0.0002,"merge_weights":true,"model_id":"meta-llama/Llama-2-7b-hf","per_device_train_batch_size":2}
SM_USER_ENTRY_POINT=train.py
SM_FRAMEWORK_PARAMS={}
SM_RESOURCE_CONFIG={"current_group_name":"homogeneousCluster","current_host":"algo-1","current_instance_type":"ml.g5.8xlarge","hosts":["algo-1"],"instance_groups":[{"hosts":["algo-1"],"instance_group_name":"homogeneousCluster","instance_type":"ml.g5.8xlarge"}],"network_interface_name":"eth0"}
SM_INPUT_DATA_CONFIG={"training":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"}}
SM_OUTPUT_DATA_DIR=/opt/ml/output/data
SM_CHANNELS=["training"]
SM_CURRENT_HOST=algo-1
SM_CURRENT_INSTANCE_TYPE=ml.g5.8xlarge
SM_CURRENT_INSTANCE_GROUP=homogeneousCluster
SM_CURRENT_INSTANCE_GROUP_HOSTS=["algo-1"]
SM_INSTANCE_GROUPS=["homogeneousCluster"]
SM_INSTANCE_GROUPS_DICT={"homogeneousCluster":{"hosts":["algo-1"],"instance_group_name":"homogeneousCluster","instance_type":"ml.g5.8xlarge"}}
SM_DISTRIBUTION_INSTANCE_GROUPS=[]
SM_IS_HETERO=false
SM_MODULE_NAME=train
SM_LOG_LEVEL=20
SM_FRAMEWORK_MODULE=sagemaker_pytorch_container.training:main
SM_INPUT_DIR=/opt/ml/input
SM_INPUT_CONFIG_DIR=/opt/ml/input/config
SM_OUTPUT_DIR=/opt/ml/output
SM_NUM_CPUS=32
SM_NUM_GPUS=1
SM_NUM_NEURONS=0
SM_MODEL_DIR=/opt/ml/model
SM_MODULE_DIR=s3://sagemaker-us-east-1-535772764458/huggingface-qlora-2023-09-04-20-49-46-2023-09-04-21-07-14-897/source/sourcedir.tar.gz
SM_TRAINING_ENV={"additional_framework_parameters":{},"channel_input_dirs":{"training":"/opt/ml/input/data/training"},"current_host":"algo-1","current_instance_group":"homogeneousCluster","current_instance_group_hosts":["algo-1"],"current_instance_type":"ml.g5.8xlarge","distribution_hosts":[],"distribution_instance_groups":[],"framework_module":"sagemaker_pytorch_container.training:main","hosts":["algo-1"],"hyperparameters":{"dataset_path":"/opt/ml/input/data/training","epochs":3,"hf_token":"******************","lr":0.0002,"merge_weights":true,"model_id":"meta-llama/Llama-2-7b-hf","per_device_train_batch_size":2},"input_config_dir":"/opt/ml/input/config","input_data_config":{"training":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"}},"input_dir":"/opt/ml/input","instance_groups":["homogeneousCluster"],"instance_groups_dict":{"homogeneousCluster":{"hosts":["algo-1"],"instance_group_name":"homogeneousCluster","instance_type":"ml.g5.8xlarge"}},"is_hetero":false,"is_master":true,"is_modelparallel_enabled":null,"is_smddpmprun_installed":true,"job_name":"huggingface-qlora-2023-09-04-20-49-46-2023-09-04-21-07-14-897","log_level":20,"master_hostname":"algo-1","model_dir":"/opt/ml/model","module_dir":"s3://sagemaker-us-east-1-535772764458/huggingface-qlora-2023-09-04-20-49-46-2023-09-04-21-07-14-897/source/sourcedir.tar.gz","module_name":"train","network_interface_name":"eth0","num_cpus":32,"num_gpus":1,"num_neurons":0,"output_data_dir":"/opt/ml/output/data","output_dir":"/opt/ml/output","output_intermediate_dir":"/opt/ml/output/intermediate","resource_config":{"current_group_name":"homogeneousCluster","current_host":"algo-1","current_instance_type":"ml.g5.8xlarge","hosts":["algo-1"],"instance_groups":[{"hosts":["algo-1"],"instance_group_name":"homogeneousCluster","instance_type":"ml.g5.8xlarge"}],"network_interface_name":"eth0"},"user_entry_point":"train.py"}
SM_USER_ARGS=["--dataset_path","/opt/ml/input/data/training","--epochs","3","--hf_token","*****************","--lr","0.0002","--merge_weights","True","--model_id","meta-llama/Llama-2-7b-hf","--per_device_train_batch_size","2"]
SM_OUTPUT_INTERMEDIATE_DIR=/opt/ml/output/intermediate
SM_CHANNEL_TRAINING=/opt/ml/input/data/training
SM_HP_DATASET_PATH=/opt/ml/input/data/training
SM_HP_EPOCHS=3
SM_HP_HF_TOKEN=******************
SM_HP_LR=0.0002
SM_HP_MERGE_WEIGHTS=true
SM_HP_MODEL_ID=meta-llama/Llama-2-7b-hf
SM_HP_PER_DEVICE_TRAIN_BATCH_SIZE=2
PYTHONPATH=/opt/ml/code:/opt/conda/bin:/opt/conda/lib/python310.zip:/opt/conda/lib/python3.10:/opt/conda/lib/python3.10/lib-dynload:/opt/conda/lib/python3.10/site-packages
Invoking script with the following command:
/opt/conda/bin/python3.10 train.py --dataset_path /opt/ml/input/data/training --epochs 3 --hf_token *************** --lr 0.0002 --merge_weights True --model_id meta-llama/Llama-2-7b-hf --per_device_train_batch_size 2
2023-09-04 21:13:07,858 sagemaker-training-toolkit INFO Exceptions not imported for SageMaker TF as Tensorflow is not installed.
╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /opt/ml/code/train.py:5 in <module> │
│ │
│ 2 from functools import partial │
│ 3 import torch │
│ 4 from transformers import (AutoModelForCausalLM, AutoTokenizer,set_seed │
│ ❱ 5 import bitsandbytes as bnb │
│ 6 from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_tr │
│ 7 │
│ 8 │
╰──────────────────────────────────────────────────────────────────────────────╯
ModuleNotFoundError: No module named 'bitsandbytes'
2023-09-04 21:13:10,983 sagemaker-training-toolkit INFO Waiting for the process to finish and give a return code.
2023-09-04 21:13:10,984 sagemaker-training-toolkit INFO Done waiting for a return code. Received 1 from exiting process.
2023-09-04 21:13:10,984 sagemaker-training-toolkit ERROR Reporting training FAILURE
2023-09-04 21:13:10,984 sagemaker-training-toolkit ERROR ExecuteUserScriptError:
ExitCode 1
ErrorMessage "ModuleNotFoundError: No module named 'bitsandbytes'"
Command "/opt/conda/bin/python3.10 train.py --dataset_path /opt/ml/input/data/training --epochs 3 --hf_token ******************* --lr 0.0002 --merge_weights True --model_id meta-llama/Llama-2-7b-hf --per_device_train_batch_size 2"
2023-09-04 21:13:10,984 sagemaker-training-toolkit ERROR Encountered exit_code 1
2023-09-04 21:13:32 Uploading - Uploading generated training model
2023-09-04 21:13:32 Failed - Training job failed