Compile on t3 for Inf2 and prediction

I’ve just tried to use Inf2 with the Huggingface APi
All works fine if I use a precompile LLM from aws-neuron.

The problem appeared when I’ve tried to compile my own model.

  1. Because of some weird corp AWS setup I only have access to us-west to create, so I’ve created a t3.medium notebook there. BUT I can deploy to us-east-2 so I can use Inf2. The other implication for this I have no access to the logs in east, so all I see in what I can access in the notebook.

  2. All examples shows different package requirements. I found these packages required to have neuron export type available for the optimum-cli

!python -m pip config set global.extra-index-url https://pip.repos.neuron.amazonaws.com
!pip install -U transformers_neuronx optimum.neuron optimum[neuronx] optimum-neuron[neuronx]

Is this correct?

  1. Model export complains about ‘No neuron device available’. Different tutorials tells different info if Inferentia2 host is needed for the model export or not. When model compiles it’s cannot validate on the t3 instance. I guess that’s ok (if I don;t need Inf2 instance for the export
2024-04-04T19:23:20Z Compiler status PASS
[Compilation Time] 65.04 seconds.
[Total compilation Time] 65.04 seconds.
Validating distilbert-base-uncased-distilled-squad model...
2024-Apr-04 19:23:25.796921 19073:19073 ERROR  TDRV:tdrv_get_dev_info                       No neuron device available
...

The export also seems little random, Sometimes it fails with sigterm, sometime it’s crashing my notebook.

  1. After tar.gz the model and S3 upload I’ve managed to deploy:
from sagemaker.huggingface.model import HuggingFaceModel

env = {
#   'HF_MODEL_ID':'distilbert-base-uncased-distilled-squad', # model_id from hf.co/models
  'HF_TASK':'question-answering' # NLP task you want to use for predictions
}

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
   env=env,
   model_data=s3_model_uri,        # path to your model and script
   role=role,                      # iam role with permissions to create an Endpoint
   transformers_version="4.28.1",  # transformers version used
   pytorch_version="1.13.0",       # pytorch version used
   py_version='py38',              # python version used
   model_server_workers=2,         # number of workers for the model server
   
)

# Let SageMaker know that we've already compiled the model
huggingface_model._is_compiled_model = True

# deploy the endpoint endpoint
predictor = huggingface_model.deploy(
    initial_instance_count=1,      # number of instances
    instance_type="ml.inf2.xlarge" # AWS Inferentia Instance
)

it prints out the “------------!” but seemingly it doesn’t work

data = {
  "input": {
      "question": "How many apples?",
      "context": "Joe got 5 apples"
  }
}

# request
predictor.predict(**data)

TypeError: Predictor.predict() got an unexpected keyword argument ‘input’

My main questions:

  • Can I export a model without Inferentia2 device (like on t3 node)?
  • What are the exact list of packages I need?
  • Is there anything wrong with the code I’ev shared above?

A tested working notebook for the same or similar would be amazing!

Thanks

I figured it out so I guess I’ll answer my own question:

  1. Yes you can compile on non Inf2 instance/Notebook (I’ve used ml.t3.medium)
  2. Packages required - or this is how it has worked for me:
!python -m pip config set global.extra-index-url https://pip.repos.neuron.amazonaws.com
!pip install mkl
!pip install -U transformers_neuronx optimum.neuron optimum[neuronx] optimum-neuron[neuronx]
  1. My compile call was correct, don’t worry about the validation error at the end
!optimum-cli export neuron --model distilbert-base-uncased-distilled-squad --batch_size 1 --sequence_length 16 distilbert_base_uncased_squad_neuron/
  1. You have to define the batch size and sequence length for the model. This is working:
from sagemaker.huggingface.model import HuggingFaceModel

config = {
    "HF_MODEL_ID": "distilbert-base-uncased-distilled-squad", # model_id from hf.co/models
    "HF_TASK": "question-answering", # NLP task you want to use for predictions
    "HF_BATCH_SIZE": "1", # batch size used to compile the model
    "MAX_BATCH_SIZE": "1", # max batch size for the model
    "HF_SEQUENCE_LENGTH": "16", # length used to compile the model
}

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
   env=config,
   model_data=s3_model_uri,        # path to your model and script
   role=role,                      # iam role with permissions to create an Endpoint
   transformers_version="4.28.1",  # transformers version used
   pytorch_version="1.13.0",       # pytorch version used
   py_version='py38',              # python version used
   model_server_workers=2,         # number of workers for the model server
   
)

# Let SageMaker know that we've already compiled the model
huggingface_model._is_compiled_model = True

# deploy the endpoint endpoint
predictor = huggingface_model.deploy(
    initial_instance_count=1,      # number of instances
    instance_type="ml.inf2.xlarge" # AWS Inferentia Instance
)
  1. Predictor data format I found slightly different, this has worked for me:
data = {
      "question": "How many apples?",
      "context": "Joe got 5 apples"
}
# request
predictor.predict(data)

{‘score’: 0.621329128742218, ‘start’: 8, ‘end’: 9, ‘answer’: ‘5’}

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.