Fail to run training script on sagemaker with HugginFace estimateor

rycfung · September 4, 2023, 9:29pm

I’m running my training scrip ton SageMaker estimator, but oddly running into this issue where bitsandbytes is not found ErrorMessage "ModuleNotFoundError: No module named 'bitsandbytes'"

Pretty sure I’m just missing something stupid, but I’m able to spot anything. Am I doing anything wrong here?

import time
from sagemaker.huggingface import HuggingFace
from huggingface_hub import HfFolder

# define Training Job Name
job_name = f'huggingface-qlora-{time.strftime("%Y-%m-%d-%H-%M-%S", time.localtime())}'
model_id = 'meta-llama/Llama-2-7b-hf'
# hyperparameters, which are passed into the training job
hyperparameters ={
  'model_id': model_id,                             # pre-trained model
  'dataset_path': '/opt/ml/input/data/training',    # path where sagemaker will save training dataset
  'epochs': 3,                                      # number of training epochs
  'per_device_train_batch_size': 2,                 # batch size for training
  'lr': 2e-4,                                       # learning rate used during training
  'hf_token': HfFolder.get_token(),                 # huggingface token to access llama 2
  'merge_weights': True,                            # wether to merge LoRA into the model (needs more memory)
}

# create the Estimator
huggingface_estimator = HuggingFace(
    entry_point          = 'train.py',      # train script
    source_dir           = 'scripts',         # directory which includes all the files needed for training
    instance_type        = 'ml.g5.8xlarge',   # instances type used for the training job
    instance_count       = 1,                 # the number of instances used for training
    base_job_name        = job_name,          # the name of the training job
    role                 = role,              # Iam role used in training job to access AWS ressources, e.g. S3
    volume_size          = 300,               # the size of the EBS volume in GB
    transformers_version = '4.28',            # the transformers version used in the training job
    pytorch_version      = '2.0',             # the pytorch_version version used in the training job
    py_version           = 'py310',           # the python version used in the training job
    hyperparameters      =  hyperparameters,  # the hyperparameters passed to the training job
    environment          = { "HUGGINGFACE_HUB_CACHE": "/tmp/.cache" }, # set env variable to cache models in /tmp
)

data = {'training': training_input_path}

# starting the train job with our uploaded datasets as input
huggingface_estimator.fit(data, wait=True)

INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
Using provided s3_resource
INFO:sagemaker:Creating training-job with name: huggingface-qlora-2023-09-04-20-49-46-2023-09-04-21-07-14-897
2023-09-04 21:07:15 Starting - Starting the training job...
2023-09-04 21:07:41 Starting - Preparing the instances for training......
2023-09-04 21:08:28 Downloading - Downloading input data...
2023-09-04 21:08:48 Training - Downloading the training image..................
2023-09-04 21:12:00 Training - Training image download completed. Training in progress.......bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
2023-09-04 21:13:06,200 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training
2023-09-04 21:13:06,214 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)
2023-09-04 21:13:06,223 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.
2023-09-04 21:13:06,225 sagemaker_pytorch_container.training INFO     Invoking user training script.
2023-09-04 21:13:07,563 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)
2023-09-04 21:13:07,798 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)
2023-09-04 21:13:07,821 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)
2023-09-04 21:13:07,832 sagemaker-training-toolkit INFO     Invoking user script
Training Env:
{
    "additional_framework_parameters": {},
    "channel_input_dirs": {
        "training": "/opt/ml/input/data/training"
    },
    "current_host": "algo-1",
    "current_instance_group": "homogeneousCluster",
    "current_instance_group_hosts": [
        "algo-1"
    ],
    "current_instance_type": "ml.g5.8xlarge",
    "distribution_hosts": [],
    "distribution_instance_groups": [],
    "framework_module": "sagemaker_pytorch_container.training:main",
    "hosts": [
        "algo-1"
    ],
    "hyperparameters": {
        "dataset_path": "/opt/ml/input/data/training",
        "epochs": 3,
        "hf_token": "***********",
        "lr": 0.0002,
        "merge_weights": true,
        "model_id": "meta-llama/Llama-2-7b-hf",
        "per_device_train_batch_size": 2
    },
    "input_config_dir": "/opt/ml/input/config",
    "input_data_config": {
        "training": {
            "TrainingInputMode": "File",
            "S3DistributionType": "FullyReplicated",
            "RecordWrapperType": "None"
        }
    },
    "input_dir": "/opt/ml/input",
    "instance_groups": [
        "homogeneousCluster"
    ],
    "instance_groups_dict": {
        "homogeneousCluster": {
            "instance_group_name": "homogeneousCluster",
            "instance_type": "ml.g5.8xlarge",
            "hosts": [
                "algo-1"
            ]
        }
    },
    "is_hetero": false,
    "is_master": true,
    "is_modelparallel_enabled": null,
    "is_smddpmprun_installed": true,
    "job_name": "huggingface-qlora-2023-09-04-20-49-46-2023-09-04-21-07-14-897",
    "log_level": 20,
    "master_hostname": "algo-1",
    "model_dir": "/opt/ml/model",
    "module_dir": "s3://sagemaker-us-east-1-535772764458/huggingface-qlora-2023-09-04-20-49-46-2023-09-04-21-07-14-897/source/sourcedir.tar.gz",
    "module_name": "train",
    "network_interface_name": "eth0",
    "num_cpus": 32,
    "num_gpus": 1,
    "num_neurons": 0,
    "output_data_dir": "/opt/ml/output/data",
    "output_dir": "/opt/ml/output",
    "output_intermediate_dir": "/opt/ml/output/intermediate",
    "resource_config": {
        "current_host": "algo-1",
        "current_instance_type": "ml.g5.8xlarge",
        "current_group_name": "homogeneousCluster",
        "hosts": [
            "algo-1"
        ],
        "instance_groups": [
            {
                "instance_group_name": "homogeneousCluster",
                "instance_type": "ml.g5.8xlarge",
                "hosts": [
                    "algo-1"
                ]
            }
        ],
        "network_interface_name": "eth0"
    },
    "user_entry_point": "train.py"
}
Environment variables:
SM_HOSTS=["algo-1"]
SM_NETWORK_INTERFACE_NAME=eth0
SM_HPS={"dataset_path":"/opt/ml/input/data/training","epochs":3,"hf_token":"*************","lr":0.0002,"merge_weights":true,"model_id":"meta-llama/Llama-2-7b-hf","per_device_train_batch_size":2}
SM_USER_ENTRY_POINT=train.py
SM_FRAMEWORK_PARAMS={}
SM_RESOURCE_CONFIG={"current_group_name":"homogeneousCluster","current_host":"algo-1","current_instance_type":"ml.g5.8xlarge","hosts":["algo-1"],"instance_groups":[{"hosts":["algo-1"],"instance_group_name":"homogeneousCluster","instance_type":"ml.g5.8xlarge"}],"network_interface_name":"eth0"}
SM_INPUT_DATA_CONFIG={"training":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"}}
SM_OUTPUT_DATA_DIR=/opt/ml/output/data
SM_CHANNELS=["training"]
SM_CURRENT_HOST=algo-1
SM_CURRENT_INSTANCE_TYPE=ml.g5.8xlarge
SM_CURRENT_INSTANCE_GROUP=homogeneousCluster
SM_CURRENT_INSTANCE_GROUP_HOSTS=["algo-1"]
SM_INSTANCE_GROUPS=["homogeneousCluster"]
SM_INSTANCE_GROUPS_DICT={"homogeneousCluster":{"hosts":["algo-1"],"instance_group_name":"homogeneousCluster","instance_type":"ml.g5.8xlarge"}}
SM_DISTRIBUTION_INSTANCE_GROUPS=[]
SM_IS_HETERO=false
SM_MODULE_NAME=train
SM_LOG_LEVEL=20
SM_FRAMEWORK_MODULE=sagemaker_pytorch_container.training:main
SM_INPUT_DIR=/opt/ml/input
SM_INPUT_CONFIG_DIR=/opt/ml/input/config
SM_OUTPUT_DIR=/opt/ml/output
SM_NUM_CPUS=32
SM_NUM_GPUS=1
SM_NUM_NEURONS=0
SM_MODEL_DIR=/opt/ml/model
SM_MODULE_DIR=s3://sagemaker-us-east-1-535772764458/huggingface-qlora-2023-09-04-20-49-46-2023-09-04-21-07-14-897/source/sourcedir.tar.gz
SM_TRAINING_ENV={"additional_framework_parameters":{},"channel_input_dirs":{"training":"/opt/ml/input/data/training"},"current_host":"algo-1","current_instance_group":"homogeneousCluster","current_instance_group_hosts":["algo-1"],"current_instance_type":"ml.g5.8xlarge","distribution_hosts":[],"distribution_instance_groups":[],"framework_module":"sagemaker_pytorch_container.training:main","hosts":["algo-1"],"hyperparameters":{"dataset_path":"/opt/ml/input/data/training","epochs":3,"hf_token":"******************","lr":0.0002,"merge_weights":true,"model_id":"meta-llama/Llama-2-7b-hf","per_device_train_batch_size":2},"input_config_dir":"/opt/ml/input/config","input_data_config":{"training":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"}},"input_dir":"/opt/ml/input","instance_groups":["homogeneousCluster"],"instance_groups_dict":{"homogeneousCluster":{"hosts":["algo-1"],"instance_group_name":"homogeneousCluster","instance_type":"ml.g5.8xlarge"}},"is_hetero":false,"is_master":true,"is_modelparallel_enabled":null,"is_smddpmprun_installed":true,"job_name":"huggingface-qlora-2023-09-04-20-49-46-2023-09-04-21-07-14-897","log_level":20,"master_hostname":"algo-1","model_dir":"/opt/ml/model","module_dir":"s3://sagemaker-us-east-1-535772764458/huggingface-qlora-2023-09-04-20-49-46-2023-09-04-21-07-14-897/source/sourcedir.tar.gz","module_name":"train","network_interface_name":"eth0","num_cpus":32,"num_gpus":1,"num_neurons":0,"output_data_dir":"/opt/ml/output/data","output_dir":"/opt/ml/output","output_intermediate_dir":"/opt/ml/output/intermediate","resource_config":{"current_group_name":"homogeneousCluster","current_host":"algo-1","current_instance_type":"ml.g5.8xlarge","hosts":["algo-1"],"instance_groups":[{"hosts":["algo-1"],"instance_group_name":"homogeneousCluster","instance_type":"ml.g5.8xlarge"}],"network_interface_name":"eth0"},"user_entry_point":"train.py"}
SM_USER_ARGS=["--dataset_path","/opt/ml/input/data/training","--epochs","3","--hf_token","*****************","--lr","0.0002","--merge_weights","True","--model_id","meta-llama/Llama-2-7b-hf","--per_device_train_batch_size","2"]
SM_OUTPUT_INTERMEDIATE_DIR=/opt/ml/output/intermediate
SM_CHANNEL_TRAINING=/opt/ml/input/data/training
SM_HP_DATASET_PATH=/opt/ml/input/data/training
SM_HP_EPOCHS=3
SM_HP_HF_TOKEN=******************
SM_HP_LR=0.0002
SM_HP_MERGE_WEIGHTS=true
SM_HP_MODEL_ID=meta-llama/Llama-2-7b-hf
SM_HP_PER_DEVICE_TRAIN_BATCH_SIZE=2
PYTHONPATH=/opt/ml/code:/opt/conda/bin:/opt/conda/lib/python310.zip:/opt/conda/lib/python3.10:/opt/conda/lib/python3.10/lib-dynload:/opt/conda/lib/python3.10/site-packages
Invoking script with the following command:
/opt/conda/bin/python3.10 train.py --dataset_path /opt/ml/input/data/training --epochs 3 --hf_token *************** --lr 0.0002 --merge_weights True --model_id meta-llama/Llama-2-7b-hf --per_device_train_batch_size 2
2023-09-04 21:13:07,858 sagemaker-training-toolkit INFO     Exceptions not imported for SageMaker TF as Tensorflow is not installed.
╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /opt/ml/code/train.py:5 in <module>                                          │
│                                                                              │
│     2 from functools import partial                                          │
│     3 import torch                                                           │
│     4 from transformers import (AutoModelForCausalLM, AutoTokenizer,set_seed │
│ ❱   5 import bitsandbytes as bnb                                             │
│     6 from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_tr │
│     7                                                                        │
│     8                                                                        │
╰──────────────────────────────────────────────────────────────────────────────╯
ModuleNotFoundError: No module named 'bitsandbytes'
2023-09-04 21:13:10,983 sagemaker-training-toolkit INFO     Waiting for the process to finish and give a return code.
2023-09-04 21:13:10,984 sagemaker-training-toolkit INFO     Done waiting for a return code. Received 1 from exiting process.
2023-09-04 21:13:10,984 sagemaker-training-toolkit ERROR    Reporting training FAILURE
2023-09-04 21:13:10,984 sagemaker-training-toolkit ERROR    ExecuteUserScriptError:
ExitCode 1
ErrorMessage "ModuleNotFoundError: No module named 'bitsandbytes'"
Command "/opt/conda/bin/python3.10 train.py --dataset_path /opt/ml/input/data/training --epochs 3 --hf_token ******************* --lr 0.0002 --merge_weights True --model_id meta-llama/Llama-2-7b-hf --per_device_train_batch_size 2"
2023-09-04 21:13:10,984 sagemaker-training-toolkit ERROR    Encountered exit_code 1

2023-09-04 21:13:32 Uploading - Uploading generated training model
2023-09-04 21:13:32 Failed - Training job failed

rycfung · September 4, 2023, 10:34pm

Just realized I’m missing requirements.txt files. That should fix it.

Topic		Replies	Views
Training on Sagemaker with Trainer() Instance Amazon SageMaker	6	2292	November 3, 2021
Package errors running huggingface estimator on sagemaker Beginners	1	942	February 9, 2023
Sagemaker huggingface estimator tries to import tensorflow when pytorch is defined 🤗Transformers	0	443	August 19, 2022
Huggingface_hub integration: ModuleNotFoundError: No module named 'huggingface_hub' Amazon SageMaker	6	11447	December 6, 2021
Setting up environment in Sagemaker Studio Amazon SageMaker	10	4224	March 10, 2023

Fail to run training script on sagemaker with HugginFace estimateor

Related topics