Fail to run training script on sagemaker with HugginFace estimateor

I’m running my training scrip ton SageMaker estimator, but oddly running into this issue where bitsandbytes is not found ErrorMessage "ModuleNotFoundError: No module named 'bitsandbytes'"

Pretty sure I’m just missing something stupid, but I’m able to spot anything. Am I doing anything wrong here?

import time
from sagemaker.huggingface import HuggingFace
from huggingface_hub import HfFolder

# define Training Job Name
job_name = f'huggingface-qlora-{time.strftime("%Y-%m-%d-%H-%M-%S", time.localtime())}'
model_id = 'meta-llama/Llama-2-7b-hf'
# hyperparameters, which are passed into the training job
hyperparameters ={
  'model_id': model_id,                             # pre-trained model
  'dataset_path': '/opt/ml/input/data/training',    # path where sagemaker will save training dataset
  'epochs': 3,                                      # number of training epochs
  'per_device_train_batch_size': 2,                 # batch size for training
  'lr': 2e-4,                                       # learning rate used during training
  'hf_token': HfFolder.get_token(),                 # huggingface token to access llama 2
  'merge_weights': True,                            # wether to merge LoRA into the model (needs more memory)
}

# create the Estimator
huggingface_estimator = HuggingFace(
    entry_point          = 'train.py',      # train script
    source_dir           = 'scripts',         # directory which includes all the files needed for training
    instance_type        = 'ml.g5.8xlarge',   # instances type used for the training job
    instance_count       = 1,                 # the number of instances used for training
    base_job_name        = job_name,          # the name of the training job
    role                 = role,              # Iam role used in training job to access AWS ressources, e.g. S3
    volume_size          = 300,               # the size of the EBS volume in GB
    transformers_version = '4.28',            # the transformers version used in the training job
    pytorch_version      = '2.0',             # the pytorch_version version used in the training job
    py_version           = 'py310',           # the python version used in the training job
    hyperparameters      =  hyperparameters,  # the hyperparameters passed to the training job
    environment          = { "HUGGINGFACE_HUB_CACHE": "/tmp/.cache" }, # set env variable to cache models in /tmp
)

data = {'training': training_input_path}

# starting the train job with our uploaded datasets as input
huggingface_estimator.fit(data, wait=True)

INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
Using provided s3_resource
INFO:sagemaker:Creating training-job with name: huggingface-qlora-2023-09-04-20-49-46-2023-09-04-21-07-14-897
2023-09-04 21:07:15 Starting - Starting the training job...
2023-09-04 21:07:41 Starting - Preparing the instances for training......
2023-09-04 21:08:28 Downloading - Downloading input data...
2023-09-04 21:08:48 Training - Downloading the training image..................
2023-09-04 21:12:00 Training - Training image download completed. Training in progress.......bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
2023-09-04 21:13:06,200 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training
2023-09-04 21:13:06,214 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)
2023-09-04 21:13:06,223 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.
2023-09-04 21:13:06,225 sagemaker_pytorch_container.training INFO     Invoking user training script.
2023-09-04 21:13:07,563 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)
2023-09-04 21:13:07,798 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)
2023-09-04 21:13:07,821 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)
2023-09-04 21:13:07,832 sagemaker-training-toolkit INFO     Invoking user script
Training Env:
{
    "additional_framework_parameters": {},
    "channel_input_dirs": {
        "training": "/opt/ml/input/data/training"
    },
    "current_host": "algo-1",
    "current_instance_group": "homogeneousCluster",
    "current_instance_group_hosts": [
        "algo-1"
    ],
    "current_instance_type": "ml.g5.8xlarge",
    "distribution_hosts": [],
    "distribution_instance_groups": [],
    "framework_module": "sagemaker_pytorch_container.training:main",
    "hosts": [
        "algo-1"
    ],
    "hyperparameters": {
        "dataset_path": "/opt/ml/input/data/training",
        "epochs": 3,
        "hf_token": "***********",
        "lr": 0.0002,
        "merge_weights": true,
        "model_id": "meta-llama/Llama-2-7b-hf",
        "per_device_train_batch_size": 2
    },
    "input_config_dir": "/opt/ml/input/config",
    "input_data_config": {
        "training": {
            "TrainingInputMode": "File",
            "S3DistributionType": "FullyReplicated",
            "RecordWrapperType": "None"
        }
    },
    "input_dir": "/opt/ml/input",
    "instance_groups": [
        "homogeneousCluster"
    ],
    "instance_groups_dict": {
        "homogeneousCluster": {
            "instance_group_name": "homogeneousCluster",
            "instance_type": "ml.g5.8xlarge",
            "hosts": [
                "algo-1"
            ]
        }
    },
    "is_hetero": false,
    "is_master": true,
    "is_modelparallel_enabled": null,
    "is_smddpmprun_installed": true,
    "job_name": "huggingface-qlora-2023-09-04-20-49-46-2023-09-04-21-07-14-897",
    "log_level": 20,
    "master_hostname": "algo-1",
    "model_dir": "/opt/ml/model",
    "module_dir": "s3://sagemaker-us-east-1-535772764458/huggingface-qlora-2023-09-04-20-49-46-2023-09-04-21-07-14-897/source/sourcedir.tar.gz",
    "module_name": "train",
    "network_interface_name": "eth0",
    "num_cpus": 32,
    "num_gpus": 1,
    "num_neurons": 0,
    "output_data_dir": "/opt/ml/output/data",
    "output_dir": "/opt/ml/output",
    "output_intermediate_dir": "/opt/ml/output/intermediate",
    "resource_config": {
        "current_host": "algo-1",
        "current_instance_type": "ml.g5.8xlarge",
        "current_group_name": "homogeneousCluster",
        "hosts": [
            "algo-1"
        ],
        "instance_groups": [
            {
                "instance_group_name": "homogeneousCluster",
                "instance_type": "ml.g5.8xlarge",
                "hosts": [
                    "algo-1"
                ]
            }
        ],
        "network_interface_name": "eth0"
    },
    "user_entry_point": "train.py"
}
Environment variables:
SM_HOSTS=["algo-1"]
SM_NETWORK_INTERFACE_NAME=eth0
SM_HPS={"dataset_path":"/opt/ml/input/data/training","epochs":3,"hf_token":"*************","lr":0.0002,"merge_weights":true,"model_id":"meta-llama/Llama-2-7b-hf","per_device_train_batch_size":2}
SM_USER_ENTRY_POINT=train.py
SM_FRAMEWORK_PARAMS={}
SM_RESOURCE_CONFIG={"current_group_name":"homogeneousCluster","current_host":"algo-1","current_instance_type":"ml.g5.8xlarge","hosts":["algo-1"],"instance_groups":[{"hosts":["algo-1"],"instance_group_name":"homogeneousCluster","instance_type":"ml.g5.8xlarge"}],"network_interface_name":"eth0"}
SM_INPUT_DATA_CONFIG={"training":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"}}
SM_OUTPUT_DATA_DIR=/opt/ml/output/data
SM_CHANNELS=["training"]
SM_CURRENT_HOST=algo-1
SM_CURRENT_INSTANCE_TYPE=ml.g5.8xlarge
SM_CURRENT_INSTANCE_GROUP=homogeneousCluster
SM_CURRENT_INSTANCE_GROUP_HOSTS=["algo-1"]
SM_INSTANCE_GROUPS=["homogeneousCluster"]
SM_INSTANCE_GROUPS_DICT={"homogeneousCluster":{"hosts":["algo-1"],"instance_group_name":"homogeneousCluster","instance_type":"ml.g5.8xlarge"}}
SM_DISTRIBUTION_INSTANCE_GROUPS=[]
SM_IS_HETERO=false
SM_MODULE_NAME=train
SM_LOG_LEVEL=20
SM_FRAMEWORK_MODULE=sagemaker_pytorch_container.training:main
SM_INPUT_DIR=/opt/ml/input
SM_INPUT_CONFIG_DIR=/opt/ml/input/config
SM_OUTPUT_DIR=/opt/ml/output
SM_NUM_CPUS=32
SM_NUM_GPUS=1
SM_NUM_NEURONS=0
SM_MODEL_DIR=/opt/ml/model
SM_MODULE_DIR=s3://sagemaker-us-east-1-535772764458/huggingface-qlora-2023-09-04-20-49-46-2023-09-04-21-07-14-897/source/sourcedir.tar.gz
SM_TRAINING_ENV={"additional_framework_parameters":{},"channel_input_dirs":{"training":"/opt/ml/input/data/training"},"current_host":"algo-1","current_instance_group":"homogeneousCluster","current_instance_group_hosts":["algo-1"],"current_instance_type":"ml.g5.8xlarge","distribution_hosts":[],"distribution_instance_groups":[],"framework_module":"sagemaker_pytorch_container.training:main","hosts":["algo-1"],"hyperparameters":{"dataset_path":"/opt/ml/input/data/training","epochs":3,"hf_token":"******************","lr":0.0002,"merge_weights":true,"model_id":"meta-llama/Llama-2-7b-hf","per_device_train_batch_size":2},"input_config_dir":"/opt/ml/input/config","input_data_config":{"training":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"}},"input_dir":"/opt/ml/input","instance_groups":["homogeneousCluster"],"instance_groups_dict":{"homogeneousCluster":{"hosts":["algo-1"],"instance_group_name":"homogeneousCluster","instance_type":"ml.g5.8xlarge"}},"is_hetero":false,"is_master":true,"is_modelparallel_enabled":null,"is_smddpmprun_installed":true,"job_name":"huggingface-qlora-2023-09-04-20-49-46-2023-09-04-21-07-14-897","log_level":20,"master_hostname":"algo-1","model_dir":"/opt/ml/model","module_dir":"s3://sagemaker-us-east-1-535772764458/huggingface-qlora-2023-09-04-20-49-46-2023-09-04-21-07-14-897/source/sourcedir.tar.gz","module_name":"train","network_interface_name":"eth0","num_cpus":32,"num_gpus":1,"num_neurons":0,"output_data_dir":"/opt/ml/output/data","output_dir":"/opt/ml/output","output_intermediate_dir":"/opt/ml/output/intermediate","resource_config":{"current_group_name":"homogeneousCluster","current_host":"algo-1","current_instance_type":"ml.g5.8xlarge","hosts":["algo-1"],"instance_groups":[{"hosts":["algo-1"],"instance_group_name":"homogeneousCluster","instance_type":"ml.g5.8xlarge"}],"network_interface_name":"eth0"},"user_entry_point":"train.py"}
SM_USER_ARGS=["--dataset_path","/opt/ml/input/data/training","--epochs","3","--hf_token","*****************","--lr","0.0002","--merge_weights","True","--model_id","meta-llama/Llama-2-7b-hf","--per_device_train_batch_size","2"]
SM_OUTPUT_INTERMEDIATE_DIR=/opt/ml/output/intermediate
SM_CHANNEL_TRAINING=/opt/ml/input/data/training
SM_HP_DATASET_PATH=/opt/ml/input/data/training
SM_HP_EPOCHS=3
SM_HP_HF_TOKEN=******************
SM_HP_LR=0.0002
SM_HP_MERGE_WEIGHTS=true
SM_HP_MODEL_ID=meta-llama/Llama-2-7b-hf
SM_HP_PER_DEVICE_TRAIN_BATCH_SIZE=2
PYTHONPATH=/opt/ml/code:/opt/conda/bin:/opt/conda/lib/python310.zip:/opt/conda/lib/python3.10:/opt/conda/lib/python3.10/lib-dynload:/opt/conda/lib/python3.10/site-packages
Invoking script with the following command:
/opt/conda/bin/python3.10 train.py --dataset_path /opt/ml/input/data/training --epochs 3 --hf_token *************** --lr 0.0002 --merge_weights True --model_id meta-llama/Llama-2-7b-hf --per_device_train_batch_size 2
2023-09-04 21:13:07,858 sagemaker-training-toolkit INFO     Exceptions not imported for SageMaker TF as Tensorflow is not installed.
╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /opt/ml/code/train.py:5 in <module>                                          │
│                                                                              │
│     2 from functools import partial                                          │
│     3 import torch                                                           │
│     4 from transformers import (AutoModelForCausalLM, AutoTokenizer,set_seed │
│ ❱   5 import bitsandbytes as bnb                                             │
│     6 from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_tr │
│     7                                                                        │
│     8                                                                        │
╰──────────────────────────────────────────────────────────────────────────────╯
ModuleNotFoundError: No module named 'bitsandbytes'
2023-09-04 21:13:10,983 sagemaker-training-toolkit INFO     Waiting for the process to finish and give a return code.
2023-09-04 21:13:10,984 sagemaker-training-toolkit INFO     Done waiting for a return code. Received 1 from exiting process.
2023-09-04 21:13:10,984 sagemaker-training-toolkit ERROR    Reporting training FAILURE
2023-09-04 21:13:10,984 sagemaker-training-toolkit ERROR    ExecuteUserScriptError:
ExitCode 1
ErrorMessage "ModuleNotFoundError: No module named 'bitsandbytes'"
Command "/opt/conda/bin/python3.10 train.py --dataset_path /opt/ml/input/data/training --epochs 3 --hf_token ******************* --lr 0.0002 --merge_weights True --model_id meta-llama/Llama-2-7b-hf --per_device_train_batch_size 2"
2023-09-04 21:13:10,984 sagemaker-training-toolkit ERROR    Encountered exit_code 1

2023-09-04 21:13:32 Uploading - Uploading generated training model
2023-09-04 21:13:32 Failed - Training job failed

Just realized I’m missing requirements.txt files. That should fix it.

1 Like