Unable to deploy Falcon 40b OASST1 model into SageMaker TGI container

Hello, I am looking for help to troubleshoot a deployment failure of the Falcon 40b model finetuned on the OpenAssistant dataset (OpenAssistant/falcon-40b-sft-top1-560)

I am following the instructions on Securely deploy LLMs inside VPCs with Hugging Face and Amazon SageMaker and I managed to convert the model .bin files into the safetensors, preserving all existing config files. The way I did was basically by loading the model locally (model = AutoModelForCausalLM.from_pretrained(...)) and then doing model.save_pretrained(..., safe_serialization=True)

I assume the folder having all files (wtihout the .bin files) is correct, because I can load the model through AutoModelForCausalLM.from_pretrained(, use_safetensors=True) and run inference

I packed everything into a model.tar.gz file and uploaded into S3 and then requested the deployment, however I am getting this error. I wonder what I am doing wrong

> File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 124, in serve_inner
    model = get_model(model_id, revision, sharded, quantize, trust_remote_code)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 209, in get_model
    return FlashRWSharded(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_rw.py", line 161, in __init__
    model=model.to(device),
  File "/usr/src/transformers/src/transformers/modeling_utils.py", line 1903, in to
    return super().to(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1145, in to
    return self._apply(convert)
  File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 820, in _apply
    param_applied = fn(param)
  File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1143, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
	>
NotImplementedError: Cannot copy out of meta tensor; no data!

Deployment code snippet:

config = {
    'HF_MODEL_ID': "/opt/ml/model",
    'HF_TASK':'text-generation',
    'SM_NUM_GPUS': json.dumps(number_of_gpu),
    'MAX_INPUT_LENGTH': json.dumps(1024),
    'MAX_TOTAL_TOKENS': json.dumps(2048),
}

llm_image = get_huggingface_llm_image_uri(
  "huggingface",
  version="0.8.2"
)

llm_model = HuggingFaceModel(
  role=role,
  image_uri=llm_image,
  model_data=s3_model_uri,
  env=config
)

llm = llm_model.deploy(
  initial_instance_count=1,
  instance_type=instance_type,    
  container_startup_health_check_timeout=health_check_timeout,
)
1 Like